OK Lukas, I know what you mean. The community is very important to the success of the project, specially the open source ones. I'm not sure I can contribute to nutch at now, because I'm a newbie in this area. I will contribute soon. At now, I answer the questions that I have a knowledge. I really appreciate when you answer our questions because we feel motivated, and we'll say to other people that Nutch is very useful when you want to make a web search engine, not only useful, but the best way.
Regards! On 8/14/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote:
Lourival, Definitely you are not alone with this feeling. Nutch is quite active open source project so some sort of documentation lack is a natural especially when Nutch hasen't reached its 1.0 release. Believe me, I have the same problem all the time. The best way how to change this situation is to contribute! Wiki is opend to anybody, source code can be downloaded and if you are freak then you can suggest changes and if you are a real hacker (meaning you are not ashmed to use vi for anything - including writing source code) then you can even become a commiter. Once you become a commiter then you will be overloaded with work to the point that you won't be able to answer STFW questions in mail-lists... etc. :-) Regards, Lukas On 8/11/06, Lourival Júnior <[EMAIL PROTECTED]> wrote: > Yes yes, I tested the index-more and query-more plugin. They allows to > search these fields easily. However if I could find a documentation about > they I would not spend time thinking in a solution. > > Thanks a lot! > > On 8/11/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote: > > > > Hi, > > > > You need to look into source to find out what exactly it does. As far > > as I know it does not add any new filed into index (it should be done > > via index-more plugin) but it allows you to query using type: date: > > and site: I think. > > > > Lukas > > > > On 8/9/06, Lourival Júnior <[EMAIL PROTECTED]> wrote: > > > What does exactilly the query-more plugin? I tested it a few minutes ago > > and > > > it dont add any field to the result index. It's used in the webapp? > > Could > > > you give me a clarification about it? > > > > > > Thanks! > > > > > > On 8/9/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote: > > > > > > > > Hi, > > > > > > > > If my memory serves me correctly then query-more should work fine with > > > > 0.7.2 nutch too. > > > > And you are right Matthew, you need to use both [type:] or [date:] > > > > filters in combination to [url:] as you can experience empty result > > > > set if used in solo mode. I do queries like this: [url:http type:pdf] > > > > and it gives me the result I need. > > > > > > > > Lukas > > > > > > > > On 8/9/06, Lourival Júnior <[EMAIL PROTECTED]> wrote: > > > > > All right! I've done this already. I thing you dont understand my > > > > question. > > > > > What I want to do is to query my indexes using something like > > > > > "filetype:pdf". The version 0.8 already have this feature. But I'm > > using > > > > the > > > > > version 0.7.2 and I want to add this feature mannually. But I dont > > know > > > > > where I have to edit. Do you know? > > > > > > > > > > Regards, > > > > > > > > > > Lourival Junior > > > > > > > > > > On 8/9/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > > Hi, > > > > > > > > > > > > To allow more formats to be indexed you need to modify > > nutch-site.xml > > > > > > and update/add plugin.includes property (see nutch-default.xmlfor > > > > > > default settings). The following is what I have in nutch-site.xml: > > > > > > > > > > > > <property> > > > > > > <name>plugin.includes</name> > > > > > > > > > > > > > > > > > > <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|rtf|html|js|msword|mspowerpoint|msexcel|pdf|zip|rss)|index-(basic|more)|query-(basic|site|url|more)|summary-basic|scoring-opic</value> > > > > > > </property> > > > > > > > > > > > > [parse-*] is used to parse various formats, [query-more] allows > > you to > > > > > > use [type:] filter in nutch queries. > > > > > > > > > > > > Regards, > > > > > > Lukas > > > > > > > > > > > > On 8/9/06, Lourival Júnior <[EMAIL PROTECTED]> wrote: > > > > > > > Hi Lukas and everybody! > > > > > > > > > > > > > > Do you know which file in nutch 0.7.2 should I edit to add some > > > > field in > > > > > > my > > > > > > > index (i.e. file type - PDF, Word or html)?' > > > > > > > > > > > > > > On 8/8/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > I am not sure if I can give you any useful hint but the > > follwoing > > > > is > > > > > > > > what once worked for me. > > > > > > > > Example of query: url:http date:20060801 > > > > > > > > > > > > > > > > date: and type: options can be used in combination with url: > > > > > > > > Filer url:http should select all documents (unless you allowed > > > > file, > > > > > > > > ftp protocols). Plain date ot type filter select onthing if > > they > > > > are > > > > > > > > used alone. > > > > > > > > > > > > > > > > And be sure you don't introduce any space between filter name > > and > > > > its > > > > > > > > value ([date: 20060801] is not the same as [date:20060801]) > > > > > > > > > > > > > > > > Lukas > > > > > > > > > > > > > > > > On 8/8/06, Matthew Holt <[EMAIL PROTECTED]> wrote: > > > > > > > > > Howie, > > > > > > > > > I inspected my index using Luke and 20060801 shows up > > several > > > > > > times > > > > > > > > > in the index. I'm unable to query pretty much any field. > > Several > > > > > > people > > > > > > > > > seem to be having the same problem. Does anyone know whats > > going > > > > on? > > > > > > > > > > > > > > > > > > This is one of the last things I have to resolve to have > > Nutch > > > > > > deployed > > > > > > > > > successfully at my organization. Unfortunately, Friday is my > > > > last > > > > > > day. > > > > > > > > > Can anyone offer any assistance?? > > > > > > > > > Thanks, > > > > > > > > > Matt > > > > > > > > > > > > > > > > > > Howie Wang wrote: > > > > > > > > > > I think that I have problems querying for numbers and > > > > > > > > > > words with digits in them. Now that I think of it, is it > > > > > > > > > > possible it has something to do with the stemming in > > > > > > > > > > either the query filter or indexing? In either case, I > > would > > > > > > > > > > print out the text that is being indexed and the phrases > > > > > > > > > > added to the query. You could also using luke to inspect > > > > > > > > > > your index and see whether 20060801 shows up anywhere. > > > > > > > > > > > > > > > > > > > > Howie > > > > > > > > > > > > > > > > > > > >> I tried looked for a page that had the date 20060801 and > > the > > > > text > > > > > > > > > >> "test" in the page. I tried the following: > > > > > > > > > >> > > > > > > > > > >> date: 20060801 test > > > > > > > > > >> > > > > > > > > > >> and > > > > > > > > > >> > > > > > > > > > >> date 20060721-20060803 test > > > > > > > > > >> > > > > > > > > > >> Neither worked, any ideas?? > > > > > > > > > >> > > > > > > > > > >> Matt > > > > > > > > > >> > > > > > > > > > >> Matthew Holt wrote: > > > > > > > > > >>> Thanks Jake, > > > > > > > > > >>> However, it seems to me that it makes most sense that > > a > > > > query > > > > > > > > > >>> should return all pages that match the query, instead of > > > > acting > > > > > > as a > > > > > > > > > >>> content filter. However, I know its something easy to > > > > suggest > > > > > > when > > > > > > > > > >>> you're not having to implement it, so just a suggestion. > > > > > > > > > >>> > > > > > > > > > >>> Matt > > > > > > > > > >>> > > > > > > > > > >>> Vanderdray, Jacob wrote: > > > > > > > > > >>>> Try querying with both the date and something you'd > > expect > > > > to > > > > > > find > > > > > > > > > >>>> in the content. The field query filter is just a > > > > filter. It > > > > > > only > > > > > > > > > >>>> restricts your results to things that match the basic > > query > > > > and > > > > > > has > > > > > > > > > >>>> the contents you require in the field. So if you query > > for > > > > > > > > > >>>> "date:2006080 text" you'll be searching for documents > > that > > > > > > contain > > > > > > > > > >>>> "text" in one of the default query fields and has the > > value > > > > > > 2006080 > > > > > > > > > >>>> in the date field. Leaving out text in that example > > would > > > > > > > > > >>>> essentially be asking for nothing in the default fields > > and > > > > > > 2006080 > > > > > > > > > >>>> in the date field which is why it doesn't return any > > > > results. > > > > > > > > > >>>> > > > > > > > > > >>>> Hope that helps, > > > > > > > > > >>>> Jake. > > > > > > > > > >>>> > > > > > > > > > >>>> > > > > > > > > > >>>> -----Original Message----- > > > > > > > > > >>>> From: Matthew Holt [mailto:[EMAIL PROTECTED] > > > > > > > > > >>>> Sent: Wed 8/2/2006 4:58 PM > > > > > > > > > >>>> To: nutch-user@lucene.apache.org > > > > > > > > > >>>> Subject: Querying Fields > > > > > > > > > >>>> I am unable to query fields in my index in the method > > that > > > > has > > > > > > > > > >>>> been suggested. I used Luke to examine my index and the > > > > > > following > > > > > > > > > >>>> field types exist: > > > > > > > > > >>>> anchor, boost, content, contentLength, date, digest, > > host, > > > > > > > > > >>>> lastModified, primaryType, segment, site, subType, > > title, > > > > type, > > > > > > url > > > > > > > > > >>>> > > > > > > > > > >>>> However, when I do a search using one of the fields, > > > > followed > > > > > > by a > > > > > > > > > >>>> colon, an incorrect result is returned. I used Luke to > > find > > > > the > > > > > > top > > > > > > > > > >>>> term in the date field which is '20060801'. I then > > searched > > > > > > using > > > > > > > > > >>>> the following query: > > > > > > > > > >>>> date: 20060801 > > > > > > > > > >>>> > > > > > > > > > >>>> Unfortunately, nothing was returned. The correct > > plugins > > > > are > > > > > > > > > >>>> enabled, here is an excerpt from my nutch-site.xml: > > > > > > > > > >>>> > > > > > > > > > >>>> <property> > > > > > > > > > >>>> <name>plugin.includes</name> > > > > > > > > > >>>> > > > > > > > > > >>>> > > > > > > > > > > > > > > > > > > > > <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value> > > > > > > > > > >>>> > > > > > > > > > >>>> > > > > > > > > > >>>> <description>Regular expression naming plugin > > directory > > > > names > > > > > > to > > > > > > > > > >>>> include. Any plugin not matching this expression is > > > > > > excluded. > > > > > > > > > >>>> In any case you need at least include the > > > > > > nutch-extensionpoints > > > > > > > > > >>>> plugin. By > > > > > > > > > >>>> default Nutch includes crawling just HTML and plain > > text > > > > via > > > > > > > > HTTP, > > > > > > > > > >>>> and basic indexing and search plugins. > > > > > > > > > >>>> </description> > > > > > > > > > >>>> </property> > > > > > > > > > >>>> > > > > > > > > > >>>> > > > > > > > > > >>>> Any ideas? I'm not the only one having the same > > problem, I > > > > saw > > > > > > an > > > > > > > > > >>>> earlier mailing list post but couldn't find any > > resolve... > > > > > > Thanks, > > > > > > > > > >>>> > > > > > > > > > >>>> Matt > > > > > > > > > >>>> > > > > > > > > > >>>> > > > > > > > > > >>>> > > > > > > > > > >>>> > > > > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Lourival Junior > > > > > > > Universidade Federal do Pará > > > > > > > Curso de Bacharelado em Sistemas de Informação > > > > > > > http://www.ufpa.br/cbsi > > > > > > > Msn: [EMAIL PROTECTED] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Lourival Junior > > > > > Universidade Federal do Pará > > > > > Curso de Bacharelado em Sistemas de Informação > > > > > http://www.ufpa.br/cbsi > > > > > Msn: [EMAIL PROTECTED] > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Lourival Junior > > > Universidade Federal do Pará > > > Curso de Bacharelado em Sistemas de Informação > > > http://www.ufpa.br/cbsi > > > Msn: [EMAIL PROTECTED] > > > > > > > > > > > > -- > Lourival Junior > Universidade Federal do Pará > Curso de Bacharelado em Sistemas de Informação > http://www.ufpa.br/cbsi > Msn: [EMAIL PROTECTED] > >
-- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]