Github user osma commented on the pull request:
https://github.com/apache/jena/pull/53#issuecomment-99062926
> Multilingual index manages dynamically one index per language. Hence for
the case where we have two same literals with different languages, they will be
not stored into the same index.
Ah, I see. But this still doesn't help for cases where there are small
differences between literals within the same language, for example
singular/plural forms that get stemmed by the analyzer, or variations in
capitalization.
> For the "hash solution", it works fine with a sha1. So we have one more
field by doc, but I don't think it's embarrassing for the final index size.
Should I commit it ?
For me this looks like a sensible solution. But I would love to hear
comments from others, in particular on the next issue:
> I don't know either if it can disturbs the conjonctive stuff.
> However, the addEntity interacts with the updateEntity, and entries
already correspond to triples/quads isn't it ?
With the current default configuration, yes, jena-text entries correspond
to triples/quads. But with the conjunctive query support, that is no longer the
case. There is a wider issue here - is jena-text primarily an alternative
triple/quad index, or is it actually an entity index that just happens to work
on triples in the default configuration? The latter case makes deletion much
more difficult, as there is no longer a 1:1 mapping between quads and Lucene
documents.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---