Storing all that info per-token as payloads will bloat the index.
Wouldn't it be wiser to use a special token to mark page feed and end of
paragraph (numbers of which could be then stored as payloads), and scan
the token stream per document to retrieve them back? some extra
operations for retrieval, but much smaller index...
Itamar.
On 3/8/2010 11:54 PM, Erick Erickson wrote:
No, you can't do this with any existing analyzers I know of. Part
of the problem here is that there's no good generic way to KNOW
what a page and line are.
Have you investigated payloads? I'm not sure that's a good fit for
this particular problem, but it might be worth investigating.
Best
Erick
On Tue, Aug 3, 2010 at 10:58 AM, arun r<arun....@gmail.com> wrote:
hi all,
I am new to Lucene. I am trying to use Lucene to generate
data for a document classifier. I need to generate wordno, lineno,
pageno for each term/phrase. I was able to use SpanQuery/SpanNearQuery
to get the wordno (span.start()) for the term/phrase. To get pageno
and lineno, a custom Analyzer needs to be written ? Can the Analyzer
be made to recognize and newline and page feed characters and keep
track of lineno and pageno for the tokens ?
Is it possible with existing Lucene Analyzer ?
Thanks,
Arun
--
Where there is a will, there is a way !
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org