Re: Issue while parsing XML files due to control characters, help appreciated.

Lokeya Sun, 18 Mar 2007 08:31:38 -0800


Erick Erickson wrote:
> 
> Grant:
> 
>  I think that "Parsing 70 files totally takes 80 minutes" really
> means parsing 70 metadata files containing 10,000 XML
> files each.....
> 
> One Metadata File is split into 10,000 XML files which looks as below:
> 
> <root>
>       <record>
>       <header>
>       <identifier>oai:CiteSeerPSU:1</identifier>
>       <datestamp>1993-08-11</datestamp>
>       <setSpec>CiteSeerPSUset</setSpec>
>       </header>
>               <metadata>
>               <oai_citeseer:oai_citeseer
> xmlns:oai_citeseer="http://copper.ist.psu.edu/oai/oai_citeseer/"; xmlns:dc
> ="http://purl.org/dc/elements/1.1/";
> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
> xsi:schemaLocation="http://copper.ist.psu.edu/oai/oai_citeseer/
> http://copper.ist.psu.edu/oai/oai_citeseer.xsd ">   
>               <dc:title>36 Problems for Semantic Interpretation</dc:title>   
>               <oai_citeseer:author name="Gabriele Scheler">      
>                       <address>80290 Munchen , Germany</address>      
>                       <affiliation>Institut fur Informatik; Technische 
> Universitat
> Munchen</affiliation>   
>               </oai_citeseer:author>   
>               <dc:subject>Gabriele Scheler 36 Problems for Semantic
> Interpretation</dc:subject>   
>               <dc:description>This paper presents a collection of problems 
> for natural
> language analysisderived mainly from theoretical linguistics. Most of
> these problemspresent major obstacles for computational systems of
> language interpretation.The set of given sentences can easily be scaled up
> by introducing moreexamples per problem. The construction of computational
> systems couldbenefit from such a collection, either using it directly for
> training andtesting or as a set of benchmarks to qualify the performance
> of a NLPsystem.1 IntroductionThe main part of this paper consists of a
> collection of problems for semanticanalysis of natural language. The
> problems are arranged in the following way:example sentencesconcise
> description of the problemkeyword for the type of problemThe sources
> (first appearance in print) of the sentences have been left out,because
> they are sometimes hard to track and will usually not be of much use,as
> they indicate a starting-point of discussion only. The keywords howeve...
>               </dc:description>   
>               <dc:contributor>The Pennsylvania State University CiteSeer
> Archives</dc:contributor>   
>               <dc:publisher>unknown</dc:publisher>   
>               <dc:date>1993-08-11</dc:date>   
>               <dc:format>ps</dc:format>   
>               
> <dc:identifier>http://citeseer.ist.psu.edu/1.html</dc:identifier>  
> <dc:source>ftp://flop.informatik.tu-muenchen.de/pub/fki/fki-179-93.ps.gz</dc:source>
>    
>               <dc:language>en</dc:language>   
>               <dc:rights>unrestricted</dc:rights>
>               </oai_citeseer:oai_citeseer>
>               </metadata>
>       </record>
> </root>
> 
> 
> From the above I will extract the Title and the Description tags to index.
> 
> Code to do this:
> 
> 1. I have 70 directories with the name like oai_citeseerXYZ/
> 2. Under each of above directory, I have 10,000 xml files each having
> above xml data.
> 3. Program does the following
> 
>                                       File dir = new File(dirName);
>                                       String[] children = dir.list();
>                                       if (children == null) {
>                                               // Either dir does not exist or 
> is not a directory
>                                       } 
>                                       else 
>                                       {
>                                               for (int ii=0; 
> ii<children.length; ii++) 
>                                               {                               
>                         
>                                                       // Get filename of file 
> or directory
>                                                       String file = 
> children[ii];
>                                                       
> //System.out.println("The name of file parsed now ==> "+file);
>                                                       nl = 
> ReadDump.getNodeList(filename+"/"+file, "metadata");
>                                                       if(nl == null)
>                                                       {
>                                                               
> //System.out.println("Error shoudlnt be thrown ...");
>                                                       }       
>                                                       //Get the metadata 
> element tags from xml file
>                                                       ReadDump rd = new 
> ReadDump();
> 
>                                                       //Get the Extracted 
> Tags Title, Identifier and Description
>                                                       ArrayList alist_Title = 
> rd.getElements(nl, "dc:title");
>                                                       ArrayList alist_Descr = 
> rd.getElements(nl, "dc:description");
> 
>                                                       //Create an Index under 
> DIR
>                                                       IndexWriter writer = 
> new IndexWriter("./FINAL/", new
> StopStemmingAnalyzer(),false);
>                                                       Document doc = new 
> Document();
> 
>                                                       //Get Array List 
> Elements  and add them as fileds to doc 
>                                                       for(int k=0; k < 
> alist_Title.size(); k++)
>                                                       {
>                                                               doc.add(new 
> Field("Title",alist_Title.get(k).toString(),
> Field.Store.YES, Field.Index.UN_TOKENIZED));
>                                                       }
>                                                       
>                                                       for(int k=0; k < 
> alist_Descr.size(); k++)
>                                                       {
>                                                               doc.add(new 
> Field("Description",alist_Descr.get(k).toString(),
> Field.Store.YES, Field.Index.UN_TOKENIZED));
>                                                       }                       
>                                 
> 
>                       //Add the document created out of those fields to the 
> IndexWriter which
> will create and index
>                                                       writer.addDocument(doc);
>                                                       writer.optimize();
>                                                       writer.close();
>                }
>                                                       
> 
> This is the main file which does indexing.
> 
> Hope this will give you an idea.
> 
>  
> Lokeya:
> Can you confirm my supposition? And I'd still post the code
> Grant requested if you can.....
> 
> So, you're talking about indexing 10,000 xml files in 2-3 hours,
> 8 minutes or so which is spent reading/parsing, right? It'll be
> important to know how much data you're indexing and now, so
> the code snippet is doubly important....
> 
> Erick
> 
> On 3/18/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
>>
>> Can you post the relevant indexing code?  Are you doing things like
>> optimizing after every file?  Both the parsing and the indexing sound
>> really long.  How big are these files?
>>
>> Also, I assume you machine is at least somewhat current, right?
>>
>> On Mar 18, 2007, at 1:00 AM, Lokeya wrote:
>>
>> >
>> > Thanks for your reply. I tried to check if the I/O and Parsing is
>> > taking time
>> > separately and Indexing time also. I observed that I/O and Parsing
>> > 70 files
>> > totally takes 80 minutes where as when I combine this with Indexing
>> > for a
>> > single Metadata file it nearly 2 to 3 hours. So looks like
>> > IndexWriter takes
>> > time that too when we are appending to the Index file this happens.
>> >
>> > So what is the best approach to handle this?
>> >
>> > Thanks in Advance.
>> >
>> >
>> > Erick Erickson wrote:
>> >>
>> >> See below...
>> >>
>> >> On 3/17/07, Lokeya <[EMAIL PROTECTED]> wrote:
>> >>>
>> >>>
>> >>> Hi,
>> >>>
>> >>> I am trying to index the content from XML files which are
>> >>> basically the
>> >>> metadata collected from a website which have a huge collection of
>> >>> documents.
>> >>> This metadata xml has control characters which causes errors
>> >>> while trying
>> >>> to
>> >>> parse using the DOM parser. I tried to use encoding = UTF-8 but
>> >>> looks
>> >>> like
>> >>> it doesn't cover all the unicode characters and I get error. Also
>> >>> when I
>> >>> tried to use UTF-16, I am getting Prolog content not allowed
>> >>> here. So my
>> >>> guess is there is no enoding which is going to cover almost all
>> >>> unicode
>> >>> characters. So I tried to split my metadata files into small
>> >>> files and
>> >>> processing records which doesnt throw parsing error.
>> >>>
>> >>> But by breaking metadata file into smaller files I get, 10,000
>> >>> xml files
>> >>> per
>> >>> metadata file. I have 70 metadata files, so altogether it becomes
>> >>> 7,00,000
>> >>> files. Processing them individually takes really long time using
>> >>> Lucene,
>> >>> my
>> >>> guess is I/O is time consuing, like opening every small xml file
>> >>> loading
>> >>> in
>> >>> DOM extracting required data and processing.
>> >>
>> >>
>> >>
>> >> So why don't you measure and find out before trying to make the
>> >> indexing
>> >> step more efficient? You simply cannot optimize without knowing where
>> >> you're spending your time. I can't tell you how often I've been wrong
>> >> about
>> >> "why my program was slow" <G>.
>> >>
>> >> In this case, it should be really simple. Just comment out the
>> >> part where
>> >> you index the data and run, say, one of your metadata files.. I
>> >> suspect
>> >> that
>> >> Cheolgoo Kang's response is cogent, and you indeed are spending your
>> >> time parsing the XML. I further suspect that the problem is not
>> >> disk IO,
>> >> but the time spent parsing. But until you measure, you have no clue
>> >> whether you should mess around with the Lucene parameters, or find
>> >> another parser, or just live with it.. Assuming that you comment out
>> >> Lucene and things are still slow, the next step would be to just
>> >> read in
>> >> each file and NOT parse it to figure out whether it's the IO or the
>> >> parsing.
>> >>
>> >> Then you can worry about how to fix it..
>> >>
>> >> Best
>> >> Erick
>> >>
>> >>
>> >> Qn  1: Any suggestion to get this indexing time reduced? It would be
>> >> really
>> >>> great.
>> >>>
>> >>> Qn 2 : Am I overlooking something in Lucene with respect to
>> >>> indexing?
>> >>>
>> >>> Right now 12 metadata files take 10 hrs nearly which is really a
>> >>> long
>> >>> time.
>> >>>
>> >>> Help Appreciated.
>> >>>
>> >>> Much Thanks.
>> >>> --
>> >>> View this message in context:
>> >>> http://www.nabble.com/Issue-while-parsing-XML-files-due-to-
>> >>> control-characters%2C-help-appreciated.-tf3418085.html#a9526527
>> >>> Sent from the Lucene - Java Users mailing list archive at
>> >>> Nabble.com.
>> >>>
>> >>>
>> >>> --------------------------------------------------------------------
>> >>> -
>> >>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> >>> For additional commands, e-mail: [EMAIL PROTECTED]
>> >>>
>> >>>
>> >>
>> >>
>> >
>> > --
>> > View this message in context: http://www.nabble.com/Issue-while-
>> > parsing-XML-files-due-to-control-characters%2C-help-appreciated.-
>> > tf3418085.html#a9536099
>> > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [EMAIL PROTECTED]
>> > For additional commands, e-mail: [EMAIL PROTECTED]
>> >
>>
>> --------------------------
>> Grant Ingersoll
>> Center for Natural Language Processing
>> http://www.cnlp.org
>>
>> Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
>> LuceneFAQ
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
> 
>


-- 
View this message in context: 
http://www.nabble.com/Issue-while-parsing-XML-files-due-to-control-characters%2C-help-appreciated.-tf3418085.html#a9540232
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Issue while parsing XML files due to control characters, help appreciated.

Reply via email to