Can you please share the custom Analyzer you have ? In particular, I am interested in knowing how to get access to the position, offset values for each token.
Regards, JK On Tue, Mar 18, 2008 at 10:48 AM, mark harwood <[EMAIL PROTECTED]> wrote: > I've used a custom analyzer before now to "blend in" GATE annotations as > tokens at the same position as the words they relate to. > > E.g. > Fred Smith works for Microsoft > > would be tokenized ordinarily as the following tokens: > > position offset text > ====== === === > 1 0 fred > 2 6 smith > 3 13 works > .... > But in a custom analyzer you would know the offsets of all these normal > tokens plus have visibility of the GATE annotations, including offsets. Your > custom analyzer can blend these to produce as follows: > > position offset text > ====== === === > 1 0 fred > 1 0 GATE_PERSON > 2 6 smith > 3 13 works > > The trick to adding "GATE_PERSON" at the same position as "fred" is to set > the "position increment" of this token to zero. > > Now you can construct a Lucene query that uses this position info in > queries. > i.e. instead of searching for the specific: > > "Fred works for Microsoft"~5 > > you can now search for the more general: > > "GATE_PERSON works for microsoft"~5 > > The GATE tokens e.g. "GATE_PERSON" would have to be terms you wouldn't > expect to find in normal text so they wouldn't clash. > Another way of doing this which avoids this problem might be to look at > the new payloads API. > Anyone care to wade in with if this is feasible and the state of play with > payloads? > > Cheers > Mark > > > ----- Original Message ---- > From: Grant Ingersoll <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sent: Tuesday, 18 March, 2008 12:24:02 AM > Subject: Re: Indexing/Querying Annotations and Fields for a document > > You would parse the XML (or whatever) into separate strings, and put > each piece into it's own Field in a Lucene Document > > For instance: > > Document doc = new Document(); > String body = getBody(input); > String people = getPeople(input) > Field body = new Field("body", body); > Field people = new Field("people", people); > > writer.addDocument(doc) > > > Essentially, you just need to implement the getPeople and getBody > methods to extract the appropriate content from your text. > > > On Mar 17, 2008, at 5:05 PM, lucene-seme1 s wrote: > > > I already have the document preprocessed and the annotations (i.e. > > <Person>John</Person>) are already stored in an array with features > > attached > > to some annotations (such as the root and lemma of the word). Can > > you please > > elaborate some more on how to "index them as normally would" ? > > > > Regards, > > JK > > > > > > On Mon, Mar 17, 2008 at 4:33 PM, Grant Ingersoll <[EMAIL PROTECTED]> > > wrote: > > > >> I think there are a couple of ways you can approach this, although I > >> have never used GATE. > >> > >> If these annotations are marked in line in your content, then you can > >> either preprocess the files to have them separately and index as you > >> normally would, or you can use the relatively new TeeTokenFilter and > >> SinkTokenizer to extract them as you go for use in other fields. I > >> have done this successfully for some apps that I have worked on and I > >> think it works quite nice and beats preprocessing IMO. Essentially, > >> you set up a TeeTokenFilter that recognizes your Person and then set > >> that token aside in the Sink. Then, when you construct the Person > >> field, you use the SinkTokenizer. > >> > >> HTH, > >> Grant > >> > >> On Mar 17, 2008, at 8:54 AM, lucene-seme1 s wrote: > >> > >>> Hello, > >>> > >>> I am a newbie here and still experimenting with Lucene. I have > >>> annotations > >>> and features generated by GATE for many documents and would like to > >>> index > >>> the original content of the documents in addition to the generated > >>> annotations. The annotations are in the form of [<Person> John </ > >>> Person> > >>> loves fishing]. I would like to be able to search using the Person > >>> attribute. > >>> > >>> Any hint or suggestion is highly appreciated > >>> > >>> regards, > >>> JK > >> > >> -------------------------- > >> Grant Ingersoll > >> http://www.lucenebootcamp.com > >> Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam > >> > >> Lucene Helpful Hints: > >> http://wiki.apache.org/lucene-java/BasicsOfPerformance > >> http://wiki.apache.org/lucene-java/LuceneFAQ > >> > >> > >> > >> > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: [EMAIL PROTECTED] > >> For additional commands, e-mail: [EMAIL PROTECTED] > >> > >> > > -------------------------- > Grant Ingersoll > http://www.lucenebootcamp.com > Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam > > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > ___________________________________________________________ > Rise to the challenge for Sport Relief with Yahoo! For Good > > http://uk.promotions.yahoo.com/forgood/ > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >