Re: Newbie query: problem indexing pdf files

Susam Pal Fri, 28 Sep 2007 05:36:40 -0700

Have you set the agent properties in 'conf/nutch-site.xml'? Please
check 'logs/hadoop.log' and search for the following words without the
single quotes, 'fetch', 'ERROR', 'FATAL'. Do you get any clue?


Also search for 'fetching' in 'logs/hadoop.log' to see whether it
attempted to fetch any URLs you were expecting.

Regards,
Susam Pal
http://susam.in/

On 9/28/07, Gareth Gale <[EMAIL PROTECTED]> wrote:
> Hope someone can help. I'd like to index and search only a single
> directory of my website. Doesn't work so far (both building the index
> and consequent searches). Here's my config :-
>
> Url of files to index : http://localhost:8080/mytest/filestore
>
> a) Under the nutch root directory (i.e. ~/nutch), I created a file
> urls/mytest that contains just this entry :-
>
> http://localhost:8080/mytest/filestore
>
> b) Edited conf/nutch-site.xml to have these extra entries (included pdf
> to be parsed) :-
>
> <property>
>    <name>http.content.limit</name>
>    <value>-1</value>
>    <description>The length limit for downloaded content, in bytes.
>    If this value is nonnegative (>=0), content longer than it will be
> truncated;
>    otherwise, no truncation at all.
>    </description>
> </property>
>
> <property>
>    <name>plugin.includes</name>
>
> <value>protocol-http|urlfilter-regex|parse-(text|html|htm|js|pdf|msword)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>    <description>Regular expression naming plugin directory names to
>    include.  Any plugin not matching this expression is excluded.
>    In any case you need at least include the nutch-extensionpoints
> plugin. By
>    default Nutch includes crawling just HTML and plain text via HTTP,
>    and basic indexing and search plugins. In order to use HTTPS please
> enable
>    protocol-httpclient, but be aware of possible intermittent problems
> with the
>    underlying commons-httpclient library.
>    </description>
> </property>
>
> c) Made sure the conf/crawl-urlfilter.txt didn't skip pdf files and
> added this line for my domain :-
>
> +^http://([a-z0-9]*\.)*localhost:8080/
>
> The filestore directory contains lots of pdfs but executing :-
>
> ~/nutch/bin/nutch crawl urls -dir crawl -depth 3 -topN 50 (taken from
> the 0.8 tutorial) does not index the files.
>
> Any help much appreciated !
>
>

Re: Newbie query: problem indexing pdf files

Reply via email to