Re: Newbie query: problem indexing pdf files

Susam Pal Fri, 28 Sep 2007 06:49:17 -0700

You can remove the FATAL error regarding 'http.robots.agents' by
setting the following in 'conf/nutch-site.xml'.


<property>
  <name>http.robots.agents</name>
  <value>testing,*</value>
  <description>The agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  </description>
</property>

However, I don't think this would be so critical as to prevent
fetching pages. After you have done this, just try once. If it fails
again, try searching for the following words in 'logs/hadoop.log'.

1. failed - this will tell us the urls fetcher could not fetch with
the exception that caused the failure.
2. ERROR - any other errors that occured.
3. FATAL - any fatal error
4. fetching - there would be one 'fetching' line per URL fetched.
These lines would look like:-

2007-09-28 19:16:06,918 INFO  fetcher.Fetcher - fetching
http://192.168.101.33/url

If you do not find any 'fetching' in the logs, it means something is
wrong. Most probably the crawl-urlfilter.txt may be wrong.

Regards,
Susam Pal
http://susam.in/


On 9/28/07, Gareth Gale <[EMAIL PROTECTED]> wrote:
> Sorry, I should have been clearer. Those properties are set, although
> with non-significant values. Here's my nutch-site.xml file in total :-
>
> <configuration>
>
> <property>
> <name>http.agent.name</name>
> <value>testing</value>
> <description>testing</description>
> </property>
>
> <property>
> <name>http.agent.description</name>
> <value>testing</value>
> <description>testing</description>
> </property>
>
> <property>
> <name>http.agent.url</name>
> <value>testing</value>
> <description>testing</description>
> </property>
>
> <property>
> <name>http.agent.email</name>
> <value>testing</value>
> <description>testing</description>
> </property>
>
> <property>
>    <name>http.content.limit</name>
>    <value>-1</value>
>    <description>The length limit for downloaded content, in bytes.
>    If this value is nonnegative (>=0), content longer than it will be
> truncated;
>    otherwise, no truncation at all.
>    </description>
> </property>
>
> <property>
>    <name>plugin.includes</name>
>
> <value>protocol-http|urlfilter-regex|parse-(text|html|htm|js|pdf|msword)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>    <description>Regular expression naming plugin directory names to
>    include.  Any plugin not matching this expression is excluded.
>    In any case you need at least include the nutch-extensionpoints
> plugin. By
>    default Nutch includes crawling just HTML and plain text via HTTP,
>    and basic indexing and search plugins. In order to use HTTPS please
> enable
>    protocol-httpclient, but be aware of possible intermittent problems
> with the
>    underlying commons-httpclient library.
>    </description>
> </property>
>
>
> </configuration>
>
>
>
> Susam Pal wrote:
> > If you have not set the agent properties, you must set them.
> >
> > http.agent.name
> > http.agent.description
> > http.agent.url
> > http.agent.email
> >
> > The significance of the properties are explained within the
> > <description> tags. For the time being you can set some dummy values
> > and get started.
> >
> > Regards,
> > Susam Pal
> > http://susam.in/
> >
> > On 9/28/07, Gareth Gale <[EMAIL PROTECTED]> wrote:
> >> I do indeed see a fatal error stating :-
> >>
> >> FATAL api.RobotRulesParser - Agent we advertise (testing) not listed
> >> first in 'http.robots.agents' property!
> >>
> >> Obviously this seems critical - the tutorial
> >> (http://lucene.apache.org/nutch/tutorial8.html) mentions this but not in
> >> much detail - are the values of significance ?
> >>
> >> Thanks !
> >>
> >> Susam Pal wrote:
> >>> Have you set the agent properties in 'conf/nutch-site.xml'? Please
> >>> check 'logs/hadoop.log' and search for the following words without the
> >>> single quotes, 'fetch', 'ERROR', 'FATAL'. Do you get any clue?
> >>>
> >>> Also search for 'fetching' in 'logs/hadoop.log' to see whether it
> >>> attempted to fetch any URLs you were expecting.
> >>>
> >>> Regards,
> >>> Susam Pal
> >>> http://susam.in/
> >>>
> >>> On 9/28/07, Gareth Gale <[EMAIL PROTECTED]> wrote:
> >>>> Hope someone can help. I'd like to index and search only a single
> >>>> directory of my website. Doesn't work so far (both building the index
> >>>> and consequent searches). Here's my config :-
> >>>>
> >>>> Url of files to index : http://localhost:8080/mytest/filestore
> >>>>
> >>>> a) Under the nutch root directory (i.e. ~/nutch), I created a file
> >>>> urls/mytest that contains just this entry :-
> >>>>
> >>>> http://localhost:8080/mytest/filestore
> >>>>
> >>>> b) Edited conf/nutch-site.xml to have these extra entries (included pdf
> >>>> to be parsed) :-
> >>>>
> >>>> <property>
> >>>>    <name>http.content.limit</name>
> >>>>    <value>-1</value>
> >>>>    <description>The length limit for downloaded content, in bytes.
> >>>>    If this value is nonnegative (>=0), content longer than it will be
> >>>> truncated;
> >>>>    otherwise, no truncation at all.
> >>>>    </description>
> >>>> </property>
> >>>>
> >>>> <property>
> >>>>    <name>plugin.includes</name>
> >>>>
> >>>> <value>protocol-http|urlfilter-regex|parse-(text|html|htm|js|pdf|msword)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> >>>>    <description>Regular expression naming plugin directory names to
> >>>>    include.  Any plugin not matching this expression is excluded.
> >>>>    In any case you need at least include the nutch-extensionpoints
> >>>> plugin. By
> >>>>    default Nutch includes crawling just HTML and plain text via HTTP,
> >>>>    and basic indexing and search plugins. In order to use HTTPS please
> >>>> enable
> >>>>    protocol-httpclient, but be aware of possible intermittent problems
> >>>> with the
> >>>>    underlying commons-httpclient library.
> >>>>    </description>
> >>>> </property>
> >>>>
> >>>> c) Made sure the conf/crawl-urlfilter.txt didn't skip pdf files and
> >>>> added this line for my domain :-
> >>>>
> >>>> +^http://([a-z0-9]*\.)*localhost:8080/
> >>>>
> >>>> The filestore directory contains lots of pdfs but executing :-
> >>>>
> >>>> ~/nutch/bin/nutch crawl urls -dir crawl -depth 3 -topN 50 (taken from
> >>>> the 0.8 tutorial) does not index the files.
> >>>>
> >>>> Any help much appreciated !
> >>>>
> >>>>
>

Re: Newbie query: problem indexing pdf files

Reply via email to