Re: Querying Fields

Lourival Júnior Mon, 14 Aug 2006 07:00:20 -0700

OK Lukas, I know what you mean. The community is very important to the
success of the project, specially the open source ones. I'm not sure I can
contribute to nutch at now, because I'm a newbie in this area. I will
contribute soon. At now, I answer the questions that I have a knowledge. I
really appreciate when you answer our questions because we feel motivated,
and we'll say to other people that Nutch is very useful when you want to
make a web search engine, not only useful, but the best way.


Regards!

On 8/14/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote:


Lourival,

Definitely you are not alone with this feeling. Nutch is quite active
open source project so some sort of documentation lack is a natural
especially when Nutch hasen't reached its 1.0 release. Believe me, I
have the same problem all the time.

The best way how to change this situation is to contribute! Wiki is
opend to anybody, source code can be downloaded and if you are freak
then you can suggest changes and if you are a real hacker (meaning you
are not ashmed to use vi for anything - including writing source code)
then you can even become a commiter. Once you become a commiter then
you will be overloaded with work to the point that you won't be able
to answer STFW questions in mail-lists... etc. :-)

Regards,
Lukas

On 8/11/06, Lourival Júnior <[EMAIL PROTECTED]> wrote:
> Yes yes, I tested the index-more and query-more plugin. They allows to
> search these fields easily. However if I could find a documentation
about
> they I would not spend time thinking in a solution.
>
> Thanks a lot!
>
> On 8/11/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote:
> >
> > Hi,
> >
> > You need to look into source to find out what exactly it does. As far
> > as I know it does not add any new filed into index (it should be done
> > via index-more plugin) but it allows you to query using type: date:
> > and site: I think.
> >
> > Lukas
> >
> > On 8/9/06, Lourival Júnior <[EMAIL PROTECTED]> wrote:
> > > What does exactilly the query-more plugin? I tested it a few minutes
ago
> > and
> > > it dont add any field to the result index. It's used in the webapp?
> > Could
> > > you give me a clarification about it?
> > >
> > > Thanks!
> > >
> > > On 8/9/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote:
> > > >
> > > > Hi,
> > > >
> > > > If my memory serves me correctly then query-more should work fine
with
> > > > 0.7.2 nutch too.
> > > > And you are right Matthew, you need to use both [type:] or [date:]
> > > > filters in combination to [url:] as you can experience empty
result
> > > > set if used in solo mode. I do queries like this: [url:http
type:pdf]
> > > > and it gives me the result I need.
> > > >
> > > > Lukas
> > > >
> > > > On 8/9/06, Lourival Júnior <[EMAIL PROTECTED]> wrote:
> > > > > All right! I've done this already. I thing you dont understand
my
> > > > question.
> > > > > What I want to do is to query my indexes using something like
> > > > > "filetype:pdf". The version 0.8 already have this feature. But
I'm
> > using
> > > > the
> > > > > version 0.7.2 and I want to add this feature mannually. But I
dont
> > know
> > > > > where I have to edit. Do you know?
> > > > >
> > > > > Regards,
> > > > >
> > > > > Lourival Junior
> > > > >
> > > > > On 8/9/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > To allow more formats to be indexed you need to modify
> > nutch-site.xml
> > > > > > and update/add plugin.includes property (see nutch-default.xmlfor
> > > > > > default settings). The following is what I have in
nutch-site.xml:
> > > > > >
> > > > > > <property>
> > > > > >   <name>plugin.includes</name>
> > > > > >
> > > > > >
> > > >
> >
<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|rtf|html|js|msword|mspowerpoint|msexcel|pdf|zip|rss)|index-(basic|more)|query-(basic|site|url|more)|summary-basic|scoring-opic</value>
> > > > > > </property>
> > > > > >
> > > > > > [parse-*] is used to parse various formats, [query-more]
allows
> > you to
> > > > > > use [type:] filter in nutch queries.
> > > > > >
> > > > > > Regards,
> > > > > > Lukas
> > > > > >
> > > > > > On 8/9/06, Lourival Júnior <[EMAIL PROTECTED]> wrote:
> > > > > > > Hi Lukas and everybody!
> > > > > > >
> > > > > > > Do you know which file in nutch 0.7.2 should I edit to add
some
> > > > field in
> > > > > > my
> > > > > > > index (i.e. file type - PDF, Word or html)?'
> > > > > > >
> > > > > > > On 8/8/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote:
> > > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I am not sure if I can give you any useful hint but the
> > follwoing
> > > > is
> > > > > > > > what once worked for me.
> > > > > > > > Example of query: url:http date:20060801
> > > > > > > >
> > > > > > > > date: and type: options can be used in combination with
url:
> > > > > > > > Filer url:http should select all documents (unless you
allowed
> > > > file,
> > > > > > > > ftp protocols). Plain date ot type filter select onthing
if
> > they
> > > > are
> > > > > > > > used alone.
> > > > > > > >
> > > > > > > > And be sure you don't introduce any space between filter
name
> > and
> > > > its
> > > > > > > > value ([date: 20060801] is not the same as
[date:20060801])
> > > > > > > >
> > > > > > > > Lukas
> > > > > > > >
> > > > > > > > On 8/8/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
> > > > > > > > > Howie,
> > > > > > > > >    I inspected my index using Luke and 20060801 shows up
> > several
> > > > > > times
> > > > > > > > > in the index. I'm unable to query pretty much any field.
> > Several
> > > > > > people
> > > > > > > > > seem to be having the same problem. Does anyone know
whats
> > going
> > > > on?
> > > > > > > > >
> > > > > > > > > This is one of the last things I have to resolve to have
> > Nutch
> > > > > > deployed
> > > > > > > > > successfully at my organization. Unfortunately, Friday
is my
> > > > last
> > > > > > day.
> > > > > > > > > Can anyone offer any assistance??
> > > > > > > > > Thanks,
> > > > > > > > >   Matt
> > > > > > > > >
> > > > > > > > > Howie Wang wrote:
> > > > > > > > > > I think that I have problems querying for numbers and
> > > > > > > > > > words with digits in them. Now that I think of it, is
it
> > > > > > > > > > possible it has something to do with the stemming in
> > > > > > > > > > either the query filter or indexing? In either case, I
> > would
> > > > > > > > > > print out the text that is being indexed and the
phrases
> > > > > > > > > > added to the query. You could also using luke to
inspect
> > > > > > > > > > your index and see whether 20060801 shows up anywhere.
> > > > > > > > > >
> > > > > > > > > > Howie
> > > > > > > > > >
> > > > > > > > > >> I tried looked for a page that had the date 20060801
and
> > the
> > > > text
> > > > > > > > > >> "test" in the page. I tried the following:
> > > > > > > > > >>
> > > > > > > > > >> date: 20060801 test
> > > > > > > > > >>
> > > > > > > > > >> and
> > > > > > > > > >>
> > > > > > > > > >> date 20060721-20060803 test
> > > > > > > > > >>
> > > > > > > > > >> Neither worked, any ideas??
> > > > > > > > > >>
> > > > > > > > > >> Matt
> > > > > > > > > >>
> > > > > > > > > >> Matthew Holt wrote:
> > > > > > > > > >>> Thanks Jake,
> > > > > > > > > >>>   However, it seems to me that it makes most sense
that
> > a
> > > > query
> > > > > > > > > >>> should return all pages that match the query,
instead of
> > > > acting
> > > > > > as a
> > > > > > > > > >>> content filter. However, I know its something easy
to
> > > > suggest
> > > > > > when
> > > > > > > > > >>> you're not having to implement it, so just a
suggestion.
> > > > > > > > > >>>
> > > > > > > > > >>> Matt
> > > > > > > > > >>>
> > > > > > > > > >>> Vanderdray, Jacob wrote:
> > > > > > > > > >>>> Try querying with both the date and something you'd
> > expect
> > > > to
> > > > > > find
> > > > > > > > > >>>> in the content.  The field query filter is just a
> > > > filter.  It
> > > > > > only
> > > > > > > > > >>>> restricts your results to things that match the
basic
> > query
> > > > and
> > > > > > has
> > > > > > > > > >>>> the contents you require in the field.  So if you
query
> > for
> > > > > > > > > >>>> "date:2006080 text" you'll be searching for
documents
> > that
> > > > > > contain
> > > > > > > > > >>>> "text" in one of the default query fields and has
the
> > value
> > > > > > 2006080
> > > > > > > > > >>>> in the date field.  Leaving out text in that
example
> > would
> > > > > > > > > >>>> essentially be asking for nothing in the default
fields
> > and
> > > > > > 2006080
> > > > > > > > > >>>> in the date field which is why it doesn't return
any
> > > > results.
> > > > > > > > > >>>>
> > > > > > > > > >>>> Hope that helps,
> > > > > > > > > >>>> Jake.
> > > > > > > > > >>>>
> > > > > > > > > >>>>
> > > > > > > > > >>>> -----Original Message-----
> > > > > > > > > >>>> From: Matthew Holt [mailto:[EMAIL PROTECTED]
> > > > > > > > > >>>> Sent: Wed 8/2/2006 4:58 PM
> > > > > > > > > >>>> To: [email protected]
> > > > > > > > > >>>> Subject: Querying Fields
> > > > > > > > > >>>>  I am unable to query fields in my index in the
method
> > that
> > > > has
> > > > > > > > > >>>> been suggested. I used Luke to examine my index and
the
> > > > > > following
> > > > > > > > > >>>> field types exist:
> > > > > > > > > >>>> anchor, boost, content, contentLength, date,
digest,
> > host,
> > > > > > > > > >>>> lastModified, primaryType, segment, site, subType,
> > title,
> > > > type,
> > > > > > url
> > > > > > > > > >>>>
> > > > > > > > > >>>> However, when I do a search using one of the
fields,
> > > > followed
> > > > > > by a
> > > > > > > > > >>>> colon, an incorrect result is returned. I used Luke
to
> > find
> > > > the
> > > > > > top
> > > > > > > > > >>>> term in the date field which is '20060801'. I then
> > searched
> > > > > > using
> > > > > > > > > >>>> the following query:
> > > > > > > > > >>>> date: 20060801
> > > > > > > > > >>>>
> > > > > > > > > >>>> Unfortunately, nothing was returned. The correct
> > plugins
> > > > are
> > > > > > > > > >>>> enabled, here is an excerpt from my nutch-site.xml:
> > > > > > > > > >>>>
> > > > > > > > > >>>> <property>
> > > > > > > > > >>>>   <name>plugin.includes</name>
> > > > > > > > > >>>>
> > > > > > > > > >>>>
> > > > > > > >
> > > > > >
> > > >
> >
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value>
> > > > > > > > > >>>>
> > > > > > > > > >>>>
> > > > > > > > > >>>>   <description>Regular expression naming plugin
> > directory
> > > > names
> > > > > > to
> > > > > > > > > >>>>   include.  Any plugin not matching this expression
is
> > > > > > excluded.
> > > > > > > > > >>>>   In any case you need at least include the
> > > > > > nutch-extensionpoints
> > > > > > > > > >>>> plugin. By
> > > > > > > > > >>>>   default Nutch includes crawling just HTML and
plain
> > text
> > > > via
> > > > > > > > HTTP,
> > > > > > > > > >>>>   and basic indexing and search plugins.
> > > > > > > > > >>>>   </description>
> > > > > > > > > >>>> </property>
> > > > > > > > > >>>>
> > > > > > > > > >>>>
> > > > > > > > > >>>> Any ideas? I'm not the only one having the same
> > problem, I
> > > > saw
> > > > > > an
> > > > > > > > > >>>> earlier mailing list post but couldn't find any
> > resolve...
> > > > > > Thanks,
> > > > > > > > > >>>>
> > > > > > > > > >>>>    Matt
> > > > > > > > > >>>>
> > > > > > > > > >>>>
> > > > > > > > > >>>>
> > > > > > > > > >>>>
> > > > > > > > > >>>
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Lourival Junior
> > > > > > > Universidade Federal do Pará
> > > > > > > Curso de Bacharelado em Sistemas de Informação
> > > > > > > http://www.ufpa.br/cbsi
> > > > > > > Msn: [EMAIL PROTECTED]
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Lourival Junior
> > > > > Universidade Federal do Pará
> > > > > Curso de Bacharelado em Sistemas de Informação
> > > > > http://www.ufpa.br/cbsi
> > > > > Msn: [EMAIL PROTECTED]
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Lourival Junior
> > > Universidade Federal do Pará
> > > Curso de Bacharelado em Sistemas de Informação
> > > http://www.ufpa.br/cbsi
> > > Msn: [EMAIL PROTECTED]
> > >
> > >
> >
>
>
>
> --
> Lourival Junior
> Universidade Federal do Pará
> Curso de Bacharelado em Sistemas de Informação
> http://www.ufpa.br/cbsi
> Msn: [EMAIL PROTECTED]
>
>




--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

Re: Querying Fields

Reply via email to