Ignore the last email. I ended up doing the same as Benjamin Higgins. Works great, use his email for reference if you are trying to accomplish the same thing. Matt
Matthew Holt wrote: > Thanks for the reply. I've added the plugins you suggested. However, > some of the plugins need to be modified to search for fields such as > date (see previous email from Benjamin Higgins). I am currently > modifying the query-basic DateQueryFilter.java so one is allowed to > add query.date.boost to the nutch-site.xml to enable the date field > search. > > I'll try and post my results, or commit them. > Matt > > Lukas Vlcek wrote: >> Hi, >> >> To allow more formats to be indexed you need to modify nutch-site.xml >> and update/add plugin.includes property (see nutch-default.xml for >> default settings). The following is what I have in nutch-site.xml: >> >> <property> >> <name>plugin.includes</name> >> <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|rtf|html|js|msword|mspowerpoint|msexcel|pdf|zip|rss)|index-(basic|more)|query-(basic|site|url|more)|summary-basic|scoring-opic</value> >> >> >> </property> >> >> [parse-*] is used to parse various formats, [query-more] allows you to >> use [type:] filter in nutch queries. >> >> Regards, >> Lukas >> >> On 8/9/06, Lourival Júnior <[EMAIL PROTECTED]> wrote: >>> Hi Lukas and everybody! >>> >>> Do you know which file in nutch 0.7.2 should I edit to add some >>> field in my >>> index (i.e. file type - PDF, Word or html)?' >>> >>> On 8/8/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote: >>> > >>> > Hi, >>> > >>> > I am not sure if I can give you any useful hint but the follwoing is >>> > what once worked for me. >>> > Example of query: url:http date:20060801 >>> > >>> > date: and type: options can be used in combination with url: >>> > Filer url:http should select all documents (unless you allowed file, >>> > ftp protocols). Plain date ot type filter select onthing if they are >>> > used alone. >>> > >>> > And be sure you don't introduce any space between filter name and its >>> > value ([date: 20060801] is not the same as [date:20060801]) >>> > >>> > Lukas >>> > >>> > On 8/8/06, Matthew Holt <[EMAIL PROTECTED]> wrote: >>> > > Howie, >>> > > I inspected my index using Luke and 20060801 shows up several >>> times >>> > > in the index. I'm unable to query pretty much any field. Several >>> people >>> > > seem to be having the same problem. Does anyone know whats going >>> on? >>> > > >>> > > This is one of the last things I have to resolve to have Nutch >>> deployed >>> > > successfully at my organization. Unfortunately, Friday is my >>> last day. >>> > > Can anyone offer any assistance?? >>> > > Thanks, >>> > > Matt >>> > > >>> > > Howie Wang wrote: >>> > > > I think that I have problems querying for numbers and >>> > > > words with digits in them. Now that I think of it, is it >>> > > > possible it has something to do with the stemming in >>> > > > either the query filter or indexing? In either case, I would >>> > > > print out the text that is being indexed and the phrases >>> > > > added to the query. You could also using luke to inspect >>> > > > your index and see whether 20060801 shows up anywhere. >>> > > > >>> > > > Howie >>> > > > >>> > > >> I tried looked for a page that had the date 20060801 and the >>> text >>> > > >> "test" in the page. I tried the following: >>> > > >> >>> > > >> date: 20060801 test >>> > > >> >>> > > >> and >>> > > >> >>> > > >> date 20060721-20060803 test >>> > > >> >>> > > >> Neither worked, any ideas?? >>> > > >> >>> > > >> Matt >>> > > >> >>> > > >> Matthew Holt wrote: >>> > > >>> Thanks Jake, >>> > > >>> However, it seems to me that it makes most sense that a query >>> > > >>> should return all pages that match the query, instead of >>> acting as a >>> > > >>> content filter. However, I know its something easy to >>> suggest when >>> > > >>> you're not having to implement it, so just a suggestion. >>> > > >>> >>> > > >>> Matt >>> > > >>> >>> > > >>> Vanderdray, Jacob wrote: >>> > > >>>> Try querying with both the date and something you'd expect >>> to find >>> > > >>>> in the content. The field query filter is just a filter. >>> It only >>> > > >>>> restricts your results to things that match the basic query >>> and has >>> > > >>>> the contents you require in the field. So if you query for >>> > > >>>> "date:2006080 text" you'll be searching for documents that >>> contain >>> > > >>>> "text" in one of the default query fields and has the value >>> 2006080 >>> > > >>>> in the date field. Leaving out text in that example would >>> > > >>>> essentially be asking for nothing in the default fields and >>> 2006080 >>> > > >>>> in the date field which is why it doesn't return any results. >>> > > >>>> >>> > > >>>> Hope that helps, >>> > > >>>> Jake. >>> > > >>>> >>> > > >>>> >>> > > >>>> -----Original Message----- >>> > > >>>> From: Matthew Holt [mailto:[EMAIL PROTECTED] >>> > > >>>> Sent: Wed 8/2/2006 4:58 PM >>> > > >>>> To: [email protected] >>> > > >>>> Subject: Querying Fields >>> > > >>>> I am unable to query fields in my index in the method that >>> has >>> > > >>>> been suggested. I used Luke to examine my index and the >>> following >>> > > >>>> field types exist: >>> > > >>>> anchor, boost, content, contentLength, date, digest, host, >>> > > >>>> lastModified, primaryType, segment, site, subType, title, >>> type, url >>> > > >>>> >>> > > >>>> However, when I do a search using one of the fields, >>> followed by a >>> > > >>>> colon, an incorrect result is returned. I used Luke to find >>> the top >>> > > >>>> term in the date field which is '20060801'. I then searched >>> using >>> > > >>>> the following query: >>> > > >>>> date: 20060801 >>> > > >>>> >>> > > >>>> Unfortunately, nothing was returned. The correct plugins are >>> > > >>>> enabled, here is an excerpt from my nutch-site.xml: >>> > > >>>> >>> > > >>>> <property> >>> > > >>>> <name>plugin.includes</name> >>> > > >>>> >>> > > >>>> >>> > >>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value> >>> >>> >>> > > >>>> >>> > > >>>> >>> > > >>>> <description>Regular expression naming plugin directory >>> names to >>> > > >>>> include. Any plugin not matching this expression is >>> excluded. >>> > > >>>> In any case you need at least include the >>> nutch-extensionpoints >>> > > >>>> plugin. By >>> > > >>>> default Nutch includes crawling just HTML and plain text via >>> > HTTP, >>> > > >>>> and basic indexing and search plugins. >>> > > >>>> </description> >>> > > >>>> </property> >>> > > >>>> >>> > > >>>> >>> > > >>>> Any ideas? I'm not the only one having the same problem, I >>> saw an >>> > > >>>> earlier mailing list post but couldn't find any resolve... >>> Thanks, >>> > > >>>> >>> > > >>>> Matt >>> > > >>>> >>> > > >>>> >>> > > >>>> >>> > > >>>> >>> > > >>> >>> > > > >>> > > > >>> > > > >>> > > >>> > >>> >>> >>> >>> -- >>> Lourival Junior >>> Universidade Federal do Pará >>> Curso de Bacharelado em Sistemas de Informação >>> http://www.ufpa.br/cbsi >>> Msn: [EMAIL PROTECTED] >>> >>> >> > ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
