Hi, I am trying to index the content from XML files which are basically the metadata collected from a website which have a huge collection of documents. This metadata xml has control characters which causes errors while trying to parse using the DOM parser. I tried to use encoding = UTF-8 but looks like it doesn't cover all the unicode characters and I get error. Also when I tried to use UTF-16, I am getting Prolog content not allowed here. So my guess is there is no enoding which is going to cover almost all unicode characters. So I tried to split my metadata files into small files and processing records which doesnt throw parsing error.
But by breaking metadata file into smaller files I get, 10,000 xml files per metadata file. I have 70 metadata files, so altogether it becomes 7,00,000 files. Processing them individually takes really long time using Lucene, my guess is I/O is time consuing, like opening every small xml file loading in DOM extracting required data and processing. Qn 1: Any suggestion to get this indexing time reduced? It would be really great. Qn 2 : Am I overlooking something in Lucene with respect to indexing? Right now 12 metadata files take 10 hrs nearly which is really a long time. Help Appreciated. Much Thanks. -- View this message in context: http://www.nabble.com/Issue-while-parsing-XML-files-due-to-control-characters%2C-help-appreciated.-tf3418085.html#a9526527 Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]