Re: Issue while parsing XML files due to control characters, help appreciated.

Grant Ingersoll Sun, 18 Mar 2007 09:34:57 -0800

Move index writer creation, optimization and closure outside of yourloop. I would also use a SAX parser. Take a look at the demo codeto see an example of indexing.


Cheers,
Grant


On Mar 18, 2007, at 12:31 PM, Lokeya wrote:



Erick Erickson wrote:


Grant:

 I think that "Parsing 70 files totally takes 80 minutes" really
means parsing 70 metadata files containing 10,000 XML
files each.....

One Metadata File is split into 10,000 XML files which looks asbelow:


<root>
        <record>
        <header>
        <identifier>oai:CiteSeerPSU:1</identifier>
        <datestamp>1993-08-11</datestamp>
        <setSpec>CiteSeerPSUset</setSpec>
        </header>
                <metadata>
                <oai_citeseer:oai_citeseer

xmlns:oai_citeseer="http://copper.ist.psu.edu/oai/oai_citeseer/";xmlns:dc

="http://purl.org/dc/elements/1.1/";
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
xsi:schemaLocation="http://copper.ist.psu.edu/oai/oai_citeseer/
http://copper.ist.psu.edu/oai/oai_citeseer.xsd ">
                <dc:title>36 Problems for Semantic Interpretation</dc:title>
                <oai_citeseer:author name="Gabriele Scheler">
                        <address>80290 Munchen , Germany</address>
                        <affiliation>Institut fur Informatik; Technische 
Universitat
Munchen</affiliation>
                </oai_citeseer:author>
                <dc:subject>Gabriele Scheler 36 Problems for Semantic
Interpretation</dc:subject>

<dc:description>This paper presents a collection of problems fornatural

language analysisderived mainly from theoretical linguistics. Most of
these problemspresent major obstacles for computational systems of

language interpretation.The set of given sentences can easily bescaled upby introducing moreexamples per problem. The construction ofcomputationalsystems couldbenefit from such a collection, either using itdirectly fortraining andtesting or as a set of benchmarks to qualify theperformanceof a NLPsystem.1 IntroductionThe main part of this paper consistsof a

collection of problems for semanticanalysis of natural language. The
problems are arranged in the following way:example sentencesconcise
description of the problemkeyword for the type of problemThe sources

(first appearance in print) of the sentences have been leftout,becausethey are sometimes hard to track and will usually not be of muchuse,asthey indicate a starting-point of discussion only. The keywordshoweve...

                </dc:description>
                <dc:contributor>The Pennsylvania State University CiteSeer
Archives</dc:contributor>
                <dc:publisher>unknown</dc:publisher>
                <dc:date>1993-08-11</dc:date>
                <dc:format>ps</dc:format>
                
<dc:identifier>http://citeseer.ist.psu.edu/1.html</dc:identifier>

<dc:source>ftp://flop.informatik.tu-muenchen.de/pub/fki/fki-179-93.ps.gz</dc:source>

                <dc:language>en</dc:language>
                <dc:rights>unrestricted</dc:rights>
                </oai_citeseer:oai_citeseer>
                </metadata>
        </record>
</root>

From the above I will extract the Title and the Description tagsto index.


Code to do this:

1. I have 70 directories with the name like oai_citeseerXYZ/
2. Under each of above directory, I have 10,000 xml files each having
above xml data.
3. Program does the following

                                        File dir = new File(dirName);
                                        String[] children = dir.list();
                                        if (children == null) {
                                                // Either dir does not exist or 
is not a directory
                                        }
                                        else
                                        {
                                                for (int ii=0; 
ii<children.length; ii++)
                                                {                               
                        
                                                        // Get filename of file 
or directory
                                                        String file = 
children[ii];
                                                        //System.out.println("The name 
of file parsed now ==> "+file);
                                                        nl = 
ReadDump.getNodeList(filename+"/"+file, "metadata");
                                                        if(nl == null)
                                                        {
                                                                
//System.out.println("Error shoudlnt be thrown ...");
                                                        }       
                                                        //Get the metadata 
element tags from xml file
                                                        ReadDump rd = new 
ReadDump();

                                                        //Get the Extracted 
Tags Title, Identifier and Description
                                                        ArrayList alist_Title = 
rd.getElements(nl, "dc:title");
                                                        ArrayList alist_Descr = 
rd.getElements(nl, "dc:description");

                                                        //Create an Index under 
DIR
                                                        IndexWriter writer = new 
IndexWriter("./FINAL/", new
StopStemmingAnalyzer(),false);
                                                        Document doc = new 
Document();

                                                        //Get Array List 
Elements  and add them as fileds to doc
                                                        for(int k=0; k < 
alist_Title.size(); k++)
                                                        {
                                                                doc.add(new 
Field("Title",alist_Title.get(k).toString(),
Field.Store.YES, Field.Index.UN_TOKENIZED));
                                                        }
                                                        
                                                        for(int k=0; k < 
alist_Descr.size(); k++)
                                                        {

doc.add(new Field("Description",alist_Descr.get(k).toString(),

Field.Store.YES, Field.Index.UN_TOKENIZED));
                                                        }

//Add the document created out of those fields to theIndexWriter which

will create and index
                                                        writer.addDocument(doc);
                                                        writer.optimize();
                                                        writer.close();
               }
                                                        

This is the main file which does indexing.

Hope this will give you an idea.


Lokeya:
Can you confirm my supposition? And I'd still post the code
Grant requested if you can.....

So, you're talking about indexing 10,000 xml files in 2-3 hours,
8 minutes or so which is spent reading/parsing, right? It'll be
important to know how much data you're indexing and now, so
the code snippet is doubly important....

Erick

On 3/18/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:


Can you post the relevant indexing code?  Are you doing things like

optimizing after every file? Both the parsing and the indexingsound

really long.  How big are these files?

Also, I assume you machine is at least somewhat current, right?

On Mar 18, 2007, at 1:00 AM, Lokeya wrote:


Thanks for your reply. I tried to check if the I/O and Parsing is
taking time
separately and Indexing time also. I observed that I/O and Parsing
70 files
totally takes 80 minutes where as when I combine this with Indexing
for a
single Metadata file it nearly 2 to 3 hours. So looks like
IndexWriter takes
time that too when we are appending to the Index file this happens.

So what is the best approach to handle this?

Thanks in Advance.


Erick Erickson wrote:


See below...

On 3/17/07, Lokeya <[EMAIL PROTECTED]> wrote:



Hi,

I am trying to index the content from XML files which are
basically the
metadata collected from a website which have a huge collection of
documents.
This metadata xml has control characters which causes errors
while trying
to
parse using the DOM parser. I tried to use encoding = UTF-8 but
looks
like
it doesn't cover all the unicode characters and I get error. Also
when I
tried to use UTF-16, I am getting Prolog content not allowed
here. So my
guess is there is no enoding which is going to cover almost all
unicode
characters. So I tried to split my metadata files into small
files and
processing records which doesnt throw parsing error.

But by breaking metadata file into smaller files I get, 10,000
xml files
per
metadata file. I have 70 metadata files, so altogether it becomes
7,00,000
files. Processing them individually takes really long time using
Lucene,
my
guess is I/O is time consuing, like opening every small xml file
loading
in
DOM extracting required data and processing.




So why don't you measure and find out before trying to make the
indexing

step more efficient? You simply cannot optimize without knowingwhereyou're spending your time. I can't tell you how often I've beenwrong

about
"why my program was slow" <G>.

In this case, it should be really simple. Just comment out the
part where
you index the data and run, say, one of your metadata files.. I
suspect
that

Cheolgoo Kang's response is cogent, and you indeed are spendingyour

time parsing the XML. I further suspect that the problem is not
disk IO,

but the time spent parsing. But until you measure, you have noclue

whether you should mess around with the Lucene parameters, or find

another parser, or just live with it.. Assuming that youcomment out

Lucene and things are still slow, the next step would be to just
read in

each file and NOT parse it to figure out whether it's the IO orthe

parsing.

Then you can worry about how to fix it..

Best
Erick

Qn 1: Any suggestion to get this indexing time reduced? Itwould be

really

great.

Qn 2 : Am I overlooking something in Lucene with respect to
indexing?

Right now 12 metadata files take 10 hrs nearly which is really a
long
time.

Help Appreciated.

Much Thanks.
--
View this message in context:
http://www.nabble.com/Issue-while-parsing-XML-files-due-to-
control-characters%2C-help-appreciated.-tf3418085.html#a9526527
Sent from the Lucene - Java Users mailing list archive at
Nabble.com.

--------------------------------------------------------------------

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--
View this message in context: http://www.nabble.com/Issue-while-
parsing-XML-files-due-to-control-characters%2C-help-appreciated.-
tf3418085.html#a9536099

Sent from the Lucene - Java Users mailing list archive atNabble.com.

---------------------------------------------------------------------

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
LuceneFAQ

---------------------------------------------------------------------

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--

View this message in context: http://www.nabble.com/Issue-while-parsing-XML-files-due-to-control-characters%2C-help-appreciated.-tf3418085.html#a9540232

Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Issue while parsing XML files due to control characters, help appreciated.

Reply via email to