Apoorv Naik created ATLAS-2117:
----------------------------------

             Summary: Titan Indexer tokenization issues
                 Key: ATLAS-2117
                 URL: https://issues.apache.org/jira/browse/ATLAS-2117
             Project: Atlas
          Issue Type: Bug
    Affects Versions: 0.8-incubating, 0.9-incubating, 0.8.1-incubating
            Reporter: Apoorv Naik
            Assignee: Apoorv Naik
             Fix For: 0.9-incubating, 0.8.1-incubating, 0.8-incubating


When using Solr as indexing backend, the tokenization of the string is 
performed using the StandardTokenizerFactory which treats punctuations and 
special characters as delimiters which results in the more indexed terms being 
associated with the associated vertex (document)

Also there's a LowercaseFilterFactory which makes lookup case insensitive.

This schema design doesn't work well for the current basic search enhancement 
(ATLAS-1880) causing a lot of false positives/negatives when querying the index.

The workaround/hack for this is to do an in-memory filtering when such schema 
violations are found or push the entire attribute query down to the graph which 
might be in-efficient and memory intensive. (Current JIRA will track this)

Correct solution would be to re-index the existing data with a schema change 
and not use the mentioned code workarounds for better performance of the 
search. (Should be taken up in separate JIRA)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to