Split an input document to store differents parts of it as independent lucene documents.

placoteco placoteco Mon, 21 Sep 2009 04:27:43 -0700

Hi everybody:

I would like to ask you a way to split an input document identified by an
URL, for example html or xml, to store its different parts as independent
documents within the index. Imagine that all the documents I have to crawl
have the same internal structure:
<html><body><paragraph>...</paragraph>...<paragraph>...</paragraph></body></html>.
So, I want to split that input and store every paragraph as an independent
document.


I'll try to explain it using an example. Suppose we have a link
http://my.server:myport/docA.html. Then we fetch it, but because its content
is:  <html><body><paragraph>first paragraph</paragraph><paragraph>second
paragraph</paragraph></body></html> I want to split it and store two
documents. The first one will contain the first paragraph and the second one
will contain the second paragraph. The lucene index will look something like
this:

    doc 1:
        -url: http://my.server:myport/docA.html
        -content: first paragraph
        -split: yes
        -split order: 1
        ...

   doc 2:
        -url: http://my.server:myport/docA.html
        -content: second paragraph
        -split: yes
        -split order: 2
        ...

Thanks in advance.

Split an input document to store differents parts of it as independent lucene documents.

Reply via email to