Thank you very much for the helpful answers. Most of the pages that
didn't make it into the index were indeed due to protocol errors
(mostly exceeding http.max.delay).

One quick side note. When I was looking at the Nutch wiki page for
bin/nutch segread, I noticed an error on the page and wasn't sure how
to go about fixing it, or alerting someone who can. The page currently
reads:

...

-nocontent

      ignore content data

-noparsedata

      ignore parse_data data

-nocontent

      ignore parse_text data

...

The 2nd "-nocontent" should probably be -noparsetext, right?

Thanks again for the help,
Bryan

On 12/8/05, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> Bryan Woliner wrote:
>
> >I have a couple very basic questions about Luke and indexes in
> >general. Answers to any of these questions are much appreciated:
> >
> >1. In the Luke overview tab, what does "Index version" refer to?
> >
> >
>
> It's the time (as in System.currentTimeMillis()) when the index was last
> modified.
>
> >2. Also in the overview tab, if "Has Deletions?" is equal to yes,
> >where are the possible sources of deletions? Dedup? Manual deletions
> >through luke?
> >
> >
> >
>
> Either. Both.
>
> >3. Is there any way (w/ Luke or otherwise) to get a file listing all
> >of the docs in an index. Basically is there an index equivalent of
> >this command (which outputs all the URLs in a segment):
> >
> >bin/nutch org.apache.nutch.pagedb.FetchListEntry -dumpurls segmentsDir
> >
> >
>
> You can browse through documents on the Document tab. But there is no
> option to dump all documents to a file. Besides, some fields which are
> not stored are no longer accessible, so you cannot retrieve them from
> the index (you may be able to reconstruct them, but it's a lossy operation).
>
> >4. Finally, my last question is the one I'm most perplexed by:
> >
> >I called "bin/nutch segread -list -dir" for a particular segments
> >directory and found out that one directory had 93 entries. BUT, when I
> >opened up the index of that segment in Luke, there were only 23
> >documents (and 3 deletions)! Where did the rest of the URLs go??
> >
> >
>
> Do a "segread -dump" and check what is the protocol status and parse
> status for the pages that didn't make it to the index. Most likely you
> encountered either protocol errors or parsing errors, so there was
> nothing to index from these entries.
>
> In addition, if you ran the deduplication, some of the entries in your
> index may have been deleted because they were considered duplicates.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to