On Wed, Jun 10, 2009 at 4:06 PM, Evan Daniel<evanbd at gmail.com> wrote: > On Wed, Jun 10, 2009 at 3:49 AM, Daniel Cheng<j16sdiz+freenet at gmail.com> > wrote: >> On Wed, Jun 10, 2009 at 3:18 PM, Evan Daniel<evanbd at gmail.com> wrote: >>> On Wed, Jun 10, 2009 at 2:56 AM, Daniel Cheng<j16sdiz+freenet at gmail.com> >>> wrote: >>>> On Wed, Jun 10, 2009 at 2:06 PM, Evan Daniel<evanbd at gmail.com> wrote: >>>>> On Wed, Jun 10, 2009 at 1:54 AM, Daniel Cheng<j16sdiz+freenet at >>>>> gmail.com> wrote: >>>>>> On Wed, Jun 10, 2009 at 12:02 PM, Evan Daniel<evanbd at gmail.com> wrote: >>>>>>> On my (incomplete) spider index, the index file for the word "the" (it >>>>>>> indexes no other words) is 17MB. ?This seems rather large. ?It might >>>>>>> make sense to have the spider not even bother creating an index on a >>>>>>> handful of very common words (the, be, to, of, and, a, in, I, etc). >>>>>>> Of course, this presents the occasional difficulty: >>>>>>> http://bash.org/?514353 ?I think I'm in favor of not indexing common >>>>>>> words even so. >>>>>> >>>>>> Yes, it should ignore common words. >>>>>> This is called "stopword" in search engine termology. >>>>>> >>>>>>> >>>>>>> Also, on a related note, the index splitting policy should be a bit >>>>>>> more sophisticated: in an attempt to fit within the max index size as >>>>>>> configured, it split all the way down to index_8fc42.xml. ?As a >>>>>>> result, the file index_8fc4b.xml sits all by itself at 3KiB. ?It >>>>>>> contains the two words "vergessene" and "txjmnsm". ?I suspect it would >>>>>>> have reliability issues should anyone actually want to search either >>>>>>> of those. ?It would make more sense to have all of index_8fc4 in one >>>>>>> file, since it would be only trivially larger. ?(I have a patch that I >>>>>>> thought did that, but it has a bug; I'll test once my indexwriter is >>>>>>> finished writing, since I don't want to interrupt it by reloading the >>>>>>> plugin.) >>>>>> >>>>>> "trivially larger" ... >>>>>> ugh... how trivial is trivial? >>>>>> >>>>>> the xmllibrarian can handle ?index_8fc42.xml on its own but all other >>>>>> 8fc4 on ?index_8fc4.xml. >>>>>> however, as i have stated in irc, that make index generation even slower. >>>>> >>>>> 8fc42 is 17382 KiB. ?All other 8fc4 are 79 KiB combined. >>>>> >>>>> Also, it would make index generation faster. ?The spider first does >>>>> all the work of creating 8fc4, then discards it to recreate the >>>>> sub-indexes. ?The vast majority of this work is in 8fc42, which gets >>>>> created twice. ?Not splitting the index would nearly halve the time to >>>> >>>> It don't get created twice, it shortcut early. >>>> see the estimateSize variable in IndexWriter. >>> >>> Unless I'm mistaken, the slow part of the index creation is the >>> term.getPages() call. ?That call is where all the disk io hides, no? >> >> no :) >> getPages() return a IPersistentSet (ScalableSet) which is lazy evaluated. >> >> Internally, it is a linkedset when small, btree when large. >> the .size() method is always cached. > > In this case, I don't think it helps. ?13 bytes is a gross > underestimate of the size adding a page adds to the file. > estimateSize isn't checked again until all the pages have been added.
Let's see if this patch make it faster: http://github.com/j16sdiz/plugin-XMLSpider/commit/d6104814d41521d519f3452fcf3e3d90f795b0be > Furthermore, that leaves the timing unexplained. Lots of time is spend in higher level loops: 252 private boolean generateXML(PerstRoot perstRoot, String prefix) throws IOException { [...] 299 IterableIterator<Term> termIterator = perstRoot.getTermIterator(prefix, prefix + "g"); 300 for (Term term : termIterator) { [...] 367 } The getTermIterator() would walk the b-tree and fetch a few pages from there. estimatedSize is calculated when we iternate though the termIterator. > It takes as long to generate b70 as all the rest of b7* combined. ?This is > fairly > consistent across the whole set of files (obviously some variation is > present). > > 2009-06-10 02:59 index_b6e.xml > 2009-06-10 03:00 index_b6f.xml > 2009-06-10 03:16 index_b70.xml