If you need to get back line numbers and the regex does not span lines you could consider indexing each line as a separate document.
On Tue, Jun 26, 2018, 9:04 AM Mikhail Khludnev <m...@apache.org> wrote: > I mean, you'd rather need offsets not positions, but I don't have something > definite to suggest. > > On Tue, Jun 26, 2018 at 1:29 PM Gordin, Ira <ira.gor...@sap.com> wrote: > > > Hello Mikhail, > > > > I see in the link you sent that PositionIncrementAttribute determines the > > position of this token relative to the previous Token in a TokenStream, > > used in phrase searching. > > I am not in phrase searching. > > Would you mind to explain how it can help me? > > > > Thanks, > > Ira > > > > -----Original Message----- > > From: Mikhail Khludnev [mailto:m...@apache.org] > > Sent: Tuesday, June 26, 2018 12:33 PM > > To: java-user@lucene.apache.org > > Subject: Re: How search code files for words which contains a given > > substrings? > > > > Hello, Ira. > > Note the difference between offset > > > > > https://lucene.apache.org/core/7_3_0/core/org/apache/lucene/analysis/tokenattributes/OffsetAttribute.html > > and > > position > > > > > https://lucene.apache.org/core/7_3_0/core/org/apache/lucene/analysis/tokenattributes/PositionIncrementAttribute.html > > in Lucene terminology. > > Please make sure you don't rebuild existing functionality > > > > > https://lucene.apache.org/core/7_3_1/highlighter/org/apache/lucene/search/highlight/package-summary.html#package.description > > > > > > On Tue, Jun 26, 2018 at 10:57 AM Gordin, Ira <ira.gor...@sap.com> wrote: > > > > > Hi all, > > > I started to work on project which currently search code files for > words > > > which contains a given substrings. > > > Currently it uses WhitespaceTokenizerand use regex query which wraps > the > > > searched substring with '.*'. > > > For example, if one search for 'a', the query will be '/.*a.*/'. In > this > > > way in the 'Mama loves banana' text, it will find tokens 'Mama' and > > > 'banana'. > > > Currently I need to get the start and end positions of matched tokens > in > > > the line and the line number. > > > With TokenStream I can get start and end positions of 'Mama' and > > 'banana' > > > in the full text. But I need the positions of 'a'. > > > I see 2 options. > > > Option 1: to perform additional search in returned token. > > > Option 2: to use NGramTokenizer or NGramTokenFilter (not sure which of > > > them) and in this way I hope I will get the 'a' positions in > TokenStream. > > > Additional question how I can get the line numbers and the positions > > > inside the line. > > > Many thanks in advance for your help, > > > Ira > > > > > > > > > > -- > > Sincerely yours > > Mikhail Khludnev > > > > > -- > Sincerely yours > Mikhail Khludnev >