Hi everybody: I would like to ask you a way to split an input document identified by an URL, for example html or xml, to store its different parts as independent documents within the index. Imagine that all the documents I have to crawl have the same internal structure: <html><body><paragraph>...</paragraph>...<paragraph>...</paragraph></body></html>. So, I want to split that input and store every paragraph as an independent document.
I'll try to explain it using an example. Suppose we have a link http://my.server:myport/docA.html. Then we fetch it, but because its content is: <html><body><paragraph>first paragraph</paragraph><paragraph>second paragraph</paragraph></body></html> I want to split it and store two documents. The first one will contain the first paragraph and the second one will contain the second paragraph. The lucene index will look something like this: doc 1: -url: http://my.server:myport/docA.html -content: first paragraph -split: yes -split order: 1 ... doc 2: -url: http://my.server:myport/docA.html -content: second paragraph -split: yes -split order: 2 ... Thanks in advance.