Erick Erickson wrote: > > Grant: > > I think that "Parsing 70 files totally takes 80 minutes" really > means parsing 70 metadata files containing 10,000 XML > files each..... > > One Metadata File is split into 10,000 XML files which looks as below: > > <root> > <record> > <header> > <identifier>oai:CiteSeerPSU:1</identifier> > <datestamp>1993-08-11</datestamp> > <setSpec>CiteSeerPSUset</setSpec> > </header> > <metadata> > <oai_citeseer:oai_citeseer > xmlns:oai_citeseer="http://copper.ist.psu.edu/oai/oai_citeseer/" xmlns:dc > ="http://purl.org/dc/elements/1.1/" > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" > xsi:schemaLocation="http://copper.ist.psu.edu/oai/oai_citeseer/ > http://copper.ist.psu.edu/oai/oai_citeseer.xsd "> > <dc:title>36 Problems for Semantic Interpretation</dc:title> > <oai_citeseer:author name="Gabriele Scheler"> > <address>80290 Munchen , Germany</address> > <affiliation>Institut fur Informatik; Technische > Universitat > Munchen</affiliation> > </oai_citeseer:author> > <dc:subject>Gabriele Scheler 36 Problems for Semantic > Interpretation</dc:subject> > <dc:description>This paper presents a collection of problems > for natural > language analysisderived mainly from theoretical linguistics. Most of > these problemspresent major obstacles for computational systems of > language interpretation.The set of given sentences can easily be scaled up > by introducing moreexamples per problem. The construction of computational > systems couldbenefit from such a collection, either using it directly for > training andtesting or as a set of benchmarks to qualify the performance > of a NLPsystem.1 IntroductionThe main part of this paper consists of a > collection of problems for semanticanalysis of natural language. The > problems are arranged in the following way:example sentencesconcise > description of the problemkeyword for the type of problemThe sources > (first appearance in print) of the sentences have been left out,because > they are sometimes hard to track and will usually not be of much use,as > they indicate a starting-point of discussion only. The keywords howeve... > </dc:description> > <dc:contributor>The Pennsylvania State University CiteSeer > Archives</dc:contributor> > <dc:publisher>unknown</dc:publisher> > <dc:date>1993-08-11</dc:date> > <dc:format>ps</dc:format> > > <dc:identifier>http://citeseer.ist.psu.edu/1.html</dc:identifier> > <dc:source>ftp://flop.informatik.tu-muenchen.de/pub/fki/fki-179-93.ps.gz</dc:source> > > <dc:language>en</dc:language> > <dc:rights>unrestricted</dc:rights> > </oai_citeseer:oai_citeseer> > </metadata> > </record> > </root> > > > From the above I will extract the Title and the Description tags to index. > > Code to do this: > > 1. I have 70 directories with the name like oai_citeseerXYZ/ > 2. Under each of above directory, I have 10,000 xml files each having > above xml data. > 3. Program does the following > > File dir = new File(dirName); > String[] children = dir.list(); > if (children == null) { > // Either dir does not exist or > is not a directory > } > else > { > for (int ii=0; > ii<children.length; ii++) > { > > // Get filename of file > or directory > String file = > children[ii]; > > //System.out.println("The name of file parsed now ==> "+file); > nl = > ReadDump.getNodeList(filename+"/"+file, "metadata"); > if(nl == null) > { > > //System.out.println("Error shoudlnt be thrown ..."); > } > //Get the metadata > element tags from xml file > ReadDump rd = new > ReadDump(); > > //Get the Extracted > Tags Title, Identifier and Description > ArrayList alist_Title = > rd.getElements(nl, "dc:title"); > ArrayList alist_Descr = > rd.getElements(nl, "dc:description"); > > //Create an Index under > DIR > IndexWriter writer = > new IndexWriter("./FINAL/", new > StopStemmingAnalyzer(),false); > Document doc = new > Document(); > > //Get Array List > Elements and add them as fileds to doc > for(int k=0; k < > alist_Title.size(); k++) > { > doc.add(new > Field("Title",alist_Title.get(k).toString(), > Field.Store.YES, Field.Index.UN_TOKENIZED)); > } > > for(int k=0; k < > alist_Descr.size(); k++) > { > doc.add(new > Field("Description",alist_Descr.get(k).toString(), > Field.Store.YES, Field.Index.UN_TOKENIZED)); > } > > > //Add the document created out of those fields to the > IndexWriter which > will create and index > writer.addDocument(doc); > writer.optimize(); > writer.close(); > } > > > This is the main file which does indexing. > > Hope this will give you an idea. > > > Lokeya: > Can you confirm my supposition? And I'd still post the code > Grant requested if you can..... > > So, you're talking about indexing 10,000 xml files in 2-3 hours, > 8 minutes or so which is spent reading/parsing, right? It'll be > important to know how much data you're indexing and now, so > the code snippet is doubly important.... > > Erick > > On 3/18/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: >> >> Can you post the relevant indexing code? Are you doing things like >> optimizing after every file? Both the parsing and the indexing sound >> really long. How big are these files? >> >> Also, I assume you machine is at least somewhat current, right? >> >> On Mar 18, 2007, at 1:00 AM, Lokeya wrote: >> >> > >> > Thanks for your reply. I tried to check if the I/O and Parsing is >> > taking time >> > separately and Indexing time also. I observed that I/O and Parsing >> > 70 files >> > totally takes 80 minutes where as when I combine this with Indexing >> > for a >> > single Metadata file it nearly 2 to 3 hours. So looks like >> > IndexWriter takes >> > time that too when we are appending to the Index file this happens. >> > >> > So what is the best approach to handle this? >> > >> > Thanks in Advance. >> > >> > >> > Erick Erickson wrote: >> >> >> >> See below... >> >> >> >> On 3/17/07, Lokeya <[EMAIL PROTECTED]> wrote: >> >>> >> >>> >> >>> Hi, >> >>> >> >>> I am trying to index the content from XML files which are >> >>> basically the >> >>> metadata collected from a website which have a huge collection of >> >>> documents. >> >>> This metadata xml has control characters which causes errors >> >>> while trying >> >>> to >> >>> parse using the DOM parser. I tried to use encoding = UTF-8 but >> >>> looks >> >>> like >> >>> it doesn't cover all the unicode characters and I get error. Also >> >>> when I >> >>> tried to use UTF-16, I am getting Prolog content not allowed >> >>> here. So my >> >>> guess is there is no enoding which is going to cover almost all >> >>> unicode >> >>> characters. So I tried to split my metadata files into small >> >>> files and >> >>> processing records which doesnt throw parsing error. >> >>> >> >>> But by breaking metadata file into smaller files I get, 10,000 >> >>> xml files >> >>> per >> >>> metadata file. I have 70 metadata files, so altogether it becomes >> >>> 7,00,000 >> >>> files. Processing them individually takes really long time using >> >>> Lucene, >> >>> my >> >>> guess is I/O is time consuing, like opening every small xml file >> >>> loading >> >>> in >> >>> DOM extracting required data and processing. >> >> >> >> >> >> >> >> So why don't you measure and find out before trying to make the >> >> indexing >> >> step more efficient? You simply cannot optimize without knowing where >> >> you're spending your time. I can't tell you how often I've been wrong >> >> about >> >> "why my program was slow" <G>. >> >> >> >> In this case, it should be really simple. Just comment out the >> >> part where >> >> you index the data and run, say, one of your metadata files.. I >> >> suspect >> >> that >> >> Cheolgoo Kang's response is cogent, and you indeed are spending your >> >> time parsing the XML. I further suspect that the problem is not >> >> disk IO, >> >> but the time spent parsing. But until you measure, you have no clue >> >> whether you should mess around with the Lucene parameters, or find >> >> another parser, or just live with it.. Assuming that you comment out >> >> Lucene and things are still slow, the next step would be to just >> >> read in >> >> each file and NOT parse it to figure out whether it's the IO or the >> >> parsing. >> >> >> >> Then you can worry about how to fix it.. >> >> >> >> Best >> >> Erick >> >> >> >> >> >> Qn 1: Any suggestion to get this indexing time reduced? It would be >> >> really >> >>> great. >> >>> >> >>> Qn 2 : Am I overlooking something in Lucene with respect to >> >>> indexing? >> >>> >> >>> Right now 12 metadata files take 10 hrs nearly which is really a >> >>> long >> >>> time. >> >>> >> >>> Help Appreciated. >> >>> >> >>> Much Thanks. >> >>> -- >> >>> View this message in context: >> >>> http://www.nabble.com/Issue-while-parsing-XML-files-due-to- >> >>> control-characters%2C-help-appreciated.-tf3418085.html#a9526527 >> >>> Sent from the Lucene - Java Users mailing list archive at >> >>> Nabble.com. >> >>> >> >>> >> >>> -------------------------------------------------------------------- >> >>> - >> >>> To unsubscribe, e-mail: [EMAIL PROTECTED] >> >>> For additional commands, e-mail: [EMAIL PROTECTED] >> >>> >> >>> >> >> >> >> >> > >> > -- >> > View this message in context: http://www.nabble.com/Issue-while- >> > parsing-XML-files-due-to-control-characters%2C-help-appreciated.- >> > tf3418085.html#a9536099 >> > Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> > >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: [EMAIL PROTECTED] >> > For additional commands, e-mail: [EMAIL PROTECTED] >> > >> >> -------------------------- >> Grant Ingersoll >> Center for Natural Language Processing >> http://www.cnlp.org >> >> Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ >> LuceneFAQ >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > >
-- View this message in context: http://www.nabble.com/Issue-while-parsing-XML-files-due-to-control-characters%2C-help-appreciated.-tf3418085.html#a9540232 Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]