Hi Lukas and everybody!
Do you know which file in nutch 0.7.2 should I edit to add some
field in my
index (i.e. file type - PDF, Word or html)?'
On 8/8/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> I am not sure if I can give you any useful hint but the follwoing is
> what once worked for me.
> Example of query: url:http date:20060801
>
> date: and type: options can be used in combination with url:
> Filer url:http should select all documents (unless you allowed file,
> ftp protocols). Plain date ot type filter select onthing if they are
> used alone.
>
> And be sure you don't introduce any space between filter name and its
> value ([date: 20060801] is not the same as [date:20060801])
>
> Lukas
>
> On 8/8/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
> > Howie,
> > I inspected my index using Luke and 20060801 shows up several
times
> > in the index. I'm unable to query pretty much any field. Several
people
> > seem to be having the same problem. Does anyone know whats going
on?
> >
> > This is one of the last things I have to resolve to have Nutch
deployed
> > successfully at my organization. Unfortunately, Friday is my
last day.
> > Can anyone offer any assistance??
> > Thanks,
> > Matt
> >
> > Howie Wang wrote:
> > > I think that I have problems querying for numbers and
> > > words with digits in them. Now that I think of it, is it
> > > possible it has something to do with the stemming in
> > > either the query filter or indexing? In either case, I would
> > > print out the text that is being indexed and the phrases
> > > added to the query. You could also using luke to inspect
> > > your index and see whether 20060801 shows up anywhere.
> > >
> > > Howie
> > >
> > >> I tried looked for a page that had the date 20060801 and the
text
> > >> "test" in the page. I tried the following:
> > >>
> > >> date: 20060801 test
> > >>
> > >> and
> > >>
> > >> date 20060721-20060803 test
> > >>
> > >> Neither worked, any ideas??
> > >>
> > >> Matt
> > >>
> > >> Matthew Holt wrote:
> > >>> Thanks Jake,
> > >>> However, it seems to me that it makes most sense that a query
> > >>> should return all pages that match the query, instead of
acting as a
> > >>> content filter. However, I know its something easy to
suggest when
> > >>> you're not having to implement it, so just a suggestion.
> > >>>
> > >>> Matt
> > >>>
> > >>> Vanderdray, Jacob wrote:
> > >>>> Try querying with both the date and something you'd expect
to find
> > >>>> in the content. The field query filter is just a filter.
It only
> > >>>> restricts your results to things that match the basic query
and has
> > >>>> the contents you require in the field. So if you query for
> > >>>> "date:2006080 text" you'll be searching for documents that
contain
> > >>>> "text" in one of the default query fields and has the value
2006080
> > >>>> in the date field. Leaving out text in that example would
> > >>>> essentially be asking for nothing in the default fields and
2006080
> > >>>> in the date field which is why it doesn't return any results.
> > >>>>
> > >>>> Hope that helps,
> > >>>> Jake.
> > >>>>
> > >>>>
> > >>>> -----Original Message-----
> > >>>> From: Matthew Holt [mailto:[EMAIL PROTECTED]
> > >>>> Sent: Wed 8/2/2006 4:58 PM
> > >>>> To: [email protected]
> > >>>> Subject: Querying Fields
> > >>>> I am unable to query fields in my index in the method that
has
> > >>>> been suggested. I used Luke to examine my index and the
following
> > >>>> field types exist:
> > >>>> anchor, boost, content, contentLength, date, digest, host,
> > >>>> lastModified, primaryType, segment, site, subType, title,
type, url
> > >>>>
> > >>>> However, when I do a search using one of the fields,
followed by a
> > >>>> colon, an incorrect result is returned. I used Luke to find
the top
> > >>>> term in the date field which is '20060801'. I then searched
using
> > >>>> the following query:
> > >>>> date: 20060801
> > >>>>
> > >>>> Unfortunately, nothing was returned. The correct plugins are
> > >>>> enabled, here is an excerpt from my nutch-site.xml:
> > >>>>
> > >>>> <property>
> > >>>> <name>plugin.includes</name>
> > >>>>
> > >>>>
>
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value>
> > >>>>
> > >>>>
> > >>>> <description>Regular expression naming plugin directory
names to
> > >>>> include. Any plugin not matching this expression is
excluded.
> > >>>> In any case you need at least include the
nutch-extensionpoints
> > >>>> plugin. By
> > >>>> default Nutch includes crawling just HTML and plain text via
> HTTP,
> > >>>> and basic indexing and search plugins.
> > >>>> </description>
> > >>>> </property>
> > >>>>
> > >>>>
> > >>>> Any ideas? I'm not the only one having the same problem, I
saw an
> > >>>> earlier mailing list post but couldn't find any resolve...
Thanks,
> > >>>>
> > >>>> Matt
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>
> > >
> > >
> > >
> >
>
--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]