Re: Issue while parsing XML files due to control characters, help appreciated.

Grant Ingersoll Sun, 18 Mar 2007 03:56:59 -0800

Can you post the relevant indexing code? Are you doing things likeoptimizing after every file? Both the parsing and the indexing soundreally long. How big are these files?


Also, I assume you machine is at least somewhat current, right?


On Mar 18, 2007, at 1:00 AM, Lokeya wrote:

Thanks for your reply. I tried to check if the I/O and Parsing istaking timeseparately and Indexing time also. I observed that I/O and Parsing70 filestotally takes 80 minutes where as when I combine this with Indexingfor asingle Metadata file it nearly 2 to 3 hours. So looks likeIndexWriter takes
time that too when we are appending to the Index file this happens.

So what is the best approach to handle this?

Thanks in Advance.


Erick Erickson wrote:
See below...

On 3/17/07, Lokeya <[EMAIL PROTECTED]> wrote:
Hi,
I am trying to index the content from XML files which arebasically the
metadata collected from a website which have a huge collection of
documents.
This metadata xml has control characters which causes errorswhile trying
to
parse using the DOM parser. I tried to use encoding = UTF-8 butlooks
like
it doesn't cover all the unicode characters and I get error. Alsowhen Itried to use UTF-16, I am getting Prolog content not allowedhere. So myguess is there is no enoding which is going to cover almost allunicodecharacters. So I tried to split my metadata files into smallfiles and
processing records which doesnt throw parsing error.
But by breaking metadata file into smaller files I get, 10,000xml files
per
metadata file. I have 70 metadata files, so altogether it becomes
7,00,000
files. Processing them individually takes really long time usingLucene,
my
guess is I/O is time consuing, like opening every small xml fileloading
in
DOM extracting required data and processing.
So why don't you measure and find out before trying to make theindexing
step more efficient? You simply cannot optimize without knowing where
you're spending your time. I can't tell you how often I've been wrong
about
"why my program was slow" <G>.
In this case, it should be really simple. Just comment out thepart whereyou index the data and run, say, one of your metadata files.. Isuspect
that
Cheolgoo Kang's response is cogent, and you indeed are spending your
time parsing the XML. I further suspect that the problem is notdisk IO,
but the time spent parsing. But until you measure, you have no clue
whether you should mess around with the Lucene parameters, or find
another parser, or just live with it.. Assuming that you comment out
Lucene and things are still slow, the next step would be to justread in
each file and NOT parse it to figure out whether it's the IO or the
parsing.

Then you can worry about how to fix it..

Best
Erick


Qn  1: Any suggestion to get this indexing time reduced? It would be
really
great.
Qn 2 : Am I overlooking something in Lucene with respect toindexing?
Right now 12 metadata files take 10 hrs nearly which is really along
time.

Help Appreciated.

Much Thanks.
--
View this message in context:
http://www.nabble.com/Issue-while-parsing-XML-files-due-to-control-characters%2C-help-appreciated.-tf3418085.html#a9526527Sent from the Lucene - Java Users mailing list archive atNabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--
View this message in context: http://www.nabble.com/Issue-while-parsing-XML-files-due-to-control-characters%2C-help-appreciated.-tf3418085.html#a9536099
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/LuceneFAQ




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Issue while parsing XML files due to control characters, help appreciated.

Reply via email to