[jira] [Commented] (LUCENE-8695) Word delimiter graph or span queries bug

Michael Gibney (JIRA) Mon, 18 Feb 2019 06:56:12 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-8695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771144#comment-16771144
 ]


Michael Gibney commented on LUCENE-8695:
----------------------------------------

I'll second [~khitrin]; if you're interested, I have pushed a branch that 
attempts to address this issue (linked to from 
[LUCENE-7398|https://issues.apache.org/jira/browse/LUCENE-7398#comment-16630529])
 ... feedback/testing welcome!

Regarding storing positionLength in the index -- would there be any interest in 
revisiting this possibility 
([LUCENE-4312|https://issues.apache.org/jira/browse/LUCENE-4312])? The 
branch/patch referenced above currently records positionLength in Payloads.

> Word delimiter graph or span queries bug
> ----------------------------------------
>
>                 Key: LUCENE-8695
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8695
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 7.7
>            Reporter: Pawel Rog
>            Priority: Major
>
> I have a simple query phrase query and a token stream which uses word 
> delimiter graph which fails to match. I tried different configurations of 
> word delimiter graph but could find a good solution for this. I don't 
> actually know if the problem is on word delimiter side or maybe on span 
> queries side.
> Query which is generated:
> {code:java}
>  spanNear([field:added, spanOr([field:foobarbaz, spanNear([field:foo, 
> field:bar, field:baz], 0, true)]), field:entry], 0, true)
> {code}
>  
> Code of test where I isolated the problem is attached below:
> {code:java}
> public class TestPhrase extends LuceneTestCase {
>   private static IndexSearcher searcher;
>   private static IndexReader reader;
>   private Query query;
>   private static Directory directory;
>   private static Analyzer searchAnalyzer = new Analyzer() {
>     @Override
>     public TokenStreamComponents createComponents(String fieldName) {
>       Tokenizer tokenizer = new MockTokenizer(MockTokenizer.WHITESPACE, 
> false);
>       TokenFilter filter1 = new WordDelimiterGraphFilter(tokenizer, 
> WordDelimiterIterator.DEFAULT_WORD_DELIM_TABLE,
>           WordDelimiterGraphFilter.GENERATE_WORD_PARTS |
>               WordDelimiterGraphFilter.CATENATE_WORDS |
>               WordDelimiterGraphFilter.CATENATE_NUMBERS |
>               WordDelimiterGraphFilter.SPLIT_ON_CASE_CHANGE,
>           CharArraySet.EMPTY_SET);
>       TokenFilter filter2 = new LowerCaseFilter(filter1);
>       return new TokenStreamComponents(tokenizer, filter2);
>     }
>   };
>   private static Analyzer indexAnalyzer = new Analyzer() {
>     @Override
>     public TokenStreamComponents createComponents(String fieldName) {
>       Tokenizer tokenizer = new MockTokenizer(MockTokenizer.WHITESPACE, 
> false);
>       TokenFilter filter1 = new WordDelimiterGraphFilter(tokenizer, 
> WordDelimiterIterator.DEFAULT_WORD_DELIM_TABLE,
>           WordDelimiterGraphFilter.GENERATE_WORD_PARTS |
>           WordDelimiterGraphFilter.GENERATE_NUMBER_PARTS |
>           WordDelimiterGraphFilter.CATENATE_WORDS |
>           WordDelimiterGraphFilter.CATENATE_NUMBERS |
>           WordDelimiterGraphFilter.PRESERVE_ORIGINAL |
>           WordDelimiterGraphFilter.SPLIT_ON_CASE_CHANGE,
>           CharArraySet.EMPTY_SET);
>       TokenFilter filter2 = new LowerCaseFilter(filter1);
>       return new TokenStreamComponents(tokenizer, filter2);
>     }
>     @Override
>     public int getPositionIncrementGap(String fieldName) {
>       return 100;
>     }
>   };
>   @BeforeClass
>   public static void beforeClass() throws Exception {
>     directory = newDirectory();
>     RandomIndexWriter writer = new RandomIndexWriter(random(), directory, 
> indexAnalyzer);
>     Document doc = new Document();
>     doc.add(newTextField("field", "Added FooBarBaz entry", Field.Store.YES));
>     writer.addDocument(doc);
>     reader = writer.getReader();
>     writer.close();
>     searcher = new IndexSearcher(reader);
>   }
>   @Override
>   public void setUp() throws Exception {
>     super.setUp();
>   }
>   @AfterClass
>   public static void afterClass() throws Exception {
>     searcher = null;
>     reader.close();
>     reader = null;
>     directory.close();
>     directory = null;
>   }
>   public void testSearch() throws Exception {
>     QueryParser parser = new QueryParser("field", searchAnalyzer);
>     query = parser.parse("\"Added FooBarBaz entry\"");
>     System.out.println(query);
>     ScoreDoc[] hits = searcher.search(query, 1000).scoreDocs;
>     assertEquals(1, hits.length);
>   }
> }
> {code}
>  
>  
> NOTE: I tested it on Lucene 7.1.0, 7.4.0 and 7.7.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8695) Word delimiter graph or span queries bug

Reply via email to