Hi Joshua,

> -----Original Message-----
> From: Joshua J Pavel [mailto:jpa...@us.ibm.com]
> Sent: Friday, 23 April 2010 6:57 AM
> To: nutch-user@lucene.apache.org
> Subject: Language specifications
> 
> 
> Alternate question... thanks to everyone who has tried to help me
> through
> the hadoop/AIX issues with 1.0, but I'm going to need to shelf that for
> just a second while I work on some stuff with 0.9 again.
> 
> I need to support one site that has 3 translations: English, French,
> and
> Spanish.  The language is specified on each page by tags like the
> following:
> <meta name="language" content="ES"/>
> 
> I would like to have one index but yet restrict my search results based
> upon the "lang=" parameter sent to search.jsp.  Is there a way to query
> language specific results only from the index?


Yes, there are at least a couple of ways. The easiest way is just to use Arch 
and define a separate area for each language. Then you can limit your search to 
a particular area, depending on the language. See Arch here:

http://www.atnf.csiro.au/computing/software/arch/

If you don't like easy solutions and don't mind some coding, you can add an 
extra field called "lang" to your documents by writing a couple of custom 
filters/plugins extending IndexingFilter and RawFieldQueryFilter in Nutch. For 
sample code, see how this is done in Arch. It adds several custom fields. The 
"lang" field is also useful because it is checked by Nutch when choosing an 
analyser for the document. If you want to use a custom analyser, you have to 
add this field. Next release of Arch will probably have it. It will be possible 
to automatically filter on language.   

> 
> And, a bonus question (sorry to put it in the same thread):
> Is there a way to access database information from the Nutch bean?  I'd
> like to be able to display (for healthcheck reasons) the total number
> of
> documents in the index.

I guess there are several ways. You can follow calls from Nutch bean in a 
debugger and see how it works. But, the easiest way (though possibly not the 
fastest one), is just to submit a trivial query that will match all your 
documents. For example, if you are indexing www.mysite.com, try a query like 
"host:mysite". This should do for health checks.

Regards,

Arkadi

 

> 
> Thanks again!

Reply via email to