Thanks for the reply. I've added the plugins you suggested. However, some of the plugins need to be modified to search for fields such as date (see previous email from Benjamin Higgins). I am currently modifying the query-basic DateQueryFilter.java so one is allowed to add query.date.boost to the nutch-site.xml to enable the date field search.
I'll try and post my results, or commit them. Matt Lukas Vlcek wrote: > Hi, > > To allow more formats to be indexed you need to modify nutch-site.xml > and update/add plugin.includes property (see nutch-default.xml for > default settings). The following is what I have in nutch-site.xml: > > <property> > <name>plugin.includes</name> > <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|rtf|html|js|msword|mspowerpoint|msexcel|pdf|zip|rss)|index-(basic|more)|query-(basic|site|url|more)|summary-basic|scoring-opic</value> > > > </property> > > [parse-*] is used to parse various formats, [query-more] allows you to > use [type:] filter in nutch queries. > > Regards, > Lukas > > On 8/9/06, Lourival Júnior <[EMAIL PROTECTED]> wrote: >> Hi Lukas and everybody! >> >> Do you know which file in nutch 0.7.2 should I edit to add some field >> in my >> index (i.e. file type - PDF, Word or html)?' >> >> On 8/8/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote: >> > >> > Hi, >> > >> > I am not sure if I can give you any useful hint but the follwoing is >> > what once worked for me. >> > Example of query: url:http date:20060801 >> > >> > date: and type: options can be used in combination with url: >> > Filer url:http should select all documents (unless you allowed file, >> > ftp protocols). Plain date ot type filter select onthing if they are >> > used alone. >> > >> > And be sure you don't introduce any space between filter name and its >> > value ([date: 20060801] is not the same as [date:20060801]) >> > >> > Lukas >> > >> > On 8/8/06, Matthew Holt <[EMAIL PROTECTED]> wrote: >> > > Howie, >> > > I inspected my index using Luke and 20060801 shows up several >> times >> > > in the index. I'm unable to query pretty much any field. Several >> people >> > > seem to be having the same problem. Does anyone know whats going on? >> > > >> > > This is one of the last things I have to resolve to have Nutch >> deployed >> > > successfully at my organization. Unfortunately, Friday is my last >> day. >> > > Can anyone offer any assistance?? >> > > Thanks, >> > > Matt >> > > >> > > Howie Wang wrote: >> > > > I think that I have problems querying for numbers and >> > > > words with digits in them. Now that I think of it, is it >> > > > possible it has something to do with the stemming in >> > > > either the query filter or indexing? In either case, I would >> > > > print out the text that is being indexed and the phrases >> > > > added to the query. You could also using luke to inspect >> > > > your index and see whether 20060801 shows up anywhere. >> > > > >> > > > Howie >> > > > >> > > >> I tried looked for a page that had the date 20060801 and the text >> > > >> "test" in the page. I tried the following: >> > > >> >> > > >> date: 20060801 test >> > > >> >> > > >> and >> > > >> >> > > >> date 20060721-20060803 test >> > > >> >> > > >> Neither worked, any ideas?? >> > > >> >> > > >> Matt >> > > >> >> > > >> Matthew Holt wrote: >> > > >>> Thanks Jake, >> > > >>> However, it seems to me that it makes most sense that a query >> > > >>> should return all pages that match the query, instead of >> acting as a >> > > >>> content filter. However, I know its something easy to suggest >> when >> > > >>> you're not having to implement it, so just a suggestion. >> > > >>> >> > > >>> Matt >> > > >>> >> > > >>> Vanderdray, Jacob wrote: >> > > >>>> Try querying with both the date and something you'd expect >> to find >> > > >>>> in the content. The field query filter is just a filter. >> It only >> > > >>>> restricts your results to things that match the basic query >> and has >> > > >>>> the contents you require in the field. So if you query for >> > > >>>> "date:2006080 text" you'll be searching for documents that >> contain >> > > >>>> "text" in one of the default query fields and has the value >> 2006080 >> > > >>>> in the date field. Leaving out text in that example would >> > > >>>> essentially be asking for nothing in the default fields and >> 2006080 >> > > >>>> in the date field which is why it doesn't return any results. >> > > >>>> >> > > >>>> Hope that helps, >> > > >>>> Jake. >> > > >>>> >> > > >>>> >> > > >>>> -----Original Message----- >> > > >>>> From: Matthew Holt [mailto:[EMAIL PROTECTED] >> > > >>>> Sent: Wed 8/2/2006 4:58 PM >> > > >>>> To: [email protected] >> > > >>>> Subject: Querying Fields >> > > >>>> I am unable to query fields in my index in the method that has >> > > >>>> been suggested. I used Luke to examine my index and the >> following >> > > >>>> field types exist: >> > > >>>> anchor, boost, content, contentLength, date, digest, host, >> > > >>>> lastModified, primaryType, segment, site, subType, title, >> type, url >> > > >>>> >> > > >>>> However, when I do a search using one of the fields, >> followed by a >> > > >>>> colon, an incorrect result is returned. I used Luke to find >> the top >> > > >>>> term in the date field which is '20060801'. I then searched >> using >> > > >>>> the following query: >> > > >>>> date: 20060801 >> > > >>>> >> > > >>>> Unfortunately, nothing was returned. The correct plugins are >> > > >>>> enabled, here is an excerpt from my nutch-site.xml: >> > > >>>> >> > > >>>> <property> >> > > >>>> <name>plugin.includes</name> >> > > >>>> >> > > >>>> >> > >> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value> >> >> >> > > >>>> >> > > >>>> >> > > >>>> <description>Regular expression naming plugin directory >> names to >> > > >>>> include. Any plugin not matching this expression is >> excluded. >> > > >>>> In any case you need at least include the >> nutch-extensionpoints >> > > >>>> plugin. By >> > > >>>> default Nutch includes crawling just HTML and plain text via >> > HTTP, >> > > >>>> and basic indexing and search plugins. >> > > >>>> </description> >> > > >>>> </property> >> > > >>>> >> > > >>>> >> > > >>>> Any ideas? I'm not the only one having the same problem, I >> saw an >> > > >>>> earlier mailing list post but couldn't find any resolve... >> Thanks, >> > > >>>> >> > > >>>> Matt >> > > >>>> >> > > >>>> >> > > >>>> >> > > >>>> >> > > >>> >> > > > >> > > > >> > > > >> > > >> > >> >> >> >> -- >> Lourival Junior >> Universidade Federal do Pará >> Curso de Bacharelado em Sistemas de Informação >> http://www.ufpa.br/cbsi >> Msn: [EMAIL PROTECTED] >> >> > ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
