Github user osma commented on the pull request:
https://github.com/apache/jena/pull/53#issuecomment-97699765
Hi Alexis! Thanks for the code!
This just screams for unit tests that show that it actually works in all
cases...
I'm a bit worried about the BORDER_DELIMITER trick used here. It seems very
fragile. For example, what happens if I have these two triples indexed using
jena-text:
ex:paris rdfs:label "Paris"@fr .
ex:paris rdfs:label "Paris"@en .
Then the second triple is deleted. Will both entries be dropped from the
Lucene index?
Similar things may happen for variants that differ only in letter case
(which will be folded to lowercase by most analyzers), or singular/plural forms
that get stemmed into the same base form. These may all cause false matches
when entries are about to be deleted.
If I had to implement something like this, I'd instead try to store the
original literal in the Lucene index, as intact as possible. In practice,
putting the string value and language tag in separate fields which store the
values without tokenizing (like the uri and graph fields currently). Then at
deletion time it would be easy to query for the exact same entry and remove it,
and only it, from the Lucene index.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---