On 3/17/07, Lokeya <[EMAIL PROTECTED]> wrote:

Hi,

I am trying to index the content from XML files which are basically the
metadata collected from a website which have a huge collection of documents.
This metadata xml has control characters which causes errors while trying to
parse using the DOM parser. I tried to use encoding = UTF-8 but looks like
it doesn't cover all the unicode characters and I get error. Also when I
tried to use UTF-16, I am getting Prolog content not allowed here. So my
guess is there is no enoding which is going to cover almost all unicode
characters. So I tried to split my metadata files into small files and
processing records which doesnt throw parsing error.

Are you using CDATA section in your XML documents?


But by breaking metadata file into smaller files I get, 10,000 xml files per
metadata file. I have 70 metadata files, so altogether it becomes 7,00,000
files. Processing them individually takes really long time using Lucene, my
guess is I/O is time consuing, like opening every small xml file loading in
DOM extracting required data and processing.

I think DOM is not appropriate for a batch job like your case. Why
don't you try SAX or XPP? And also, try to adjust IndexWriter's
parameters like mergeFactor, minMergeDocs.


Qn  1: Any suggestion to get this indexing time reduced? It would be really
great.

Qn 2 : Am I overlooking something in Lucene with respect to indexing?

Right now 12 metadata files take 10 hrs nearly which is really a long time.

Help Appreciated.

Much Thanks.
--
View this message in context: 
http://www.nabble.com/Issue-while-parsing-XML-files-due-to-control-characters%2C-help-appreciated.-tf3418085.html#a9526527
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




--
Cheolgoo

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to