[Nutch-general] Re: using index filters to add a field

Raghavendra Prabhu Sun, 08 Jan 2006 23:37:02 -0800

Your insight has been very helpful.

I will try to implement what you said and see if it works out.


What you are saying is have two separate webdbs and then merge the segments
to one centralized folder .

I will try this out.

And the url in the indexing filter upon which we categorize it shuld we have
a matching urlfilter there
.

How do we make use of nutch's existing urlfilter so that by using it we can
add our own
categorisation to the index plugin

Rgds
Raghavendra Prabhu



On 1/9/06, Howie Wang <[EMAIL PROTECTED]> wrote:
>
> Someone correct me if I'm wrong, but I believe you can have
> two different crawls using two different webdbs, then you can
> copy the indexed segments into one central directory where
> the searcher can pick it up as a single index.
>
> If you're crawling habits are different for the movies and music
> stuff, it might be easier to do it this way. If you have overlap
> between the two sites and want to use the same webdb, you
> might have to do what you're saying, but I think you'll have to
> write your own utility set the fetch times the way you want.
>
> Howie
>
> >How abt refetching pages with only movies
> >
> >So during refetch if want to fetch only movie pages
> >
> >What do we do ?
> >
> >When we generate segments from the db.
> >
> >Can we mention that pick only the following urls matching a certain query
> >
> >Wont this create difficulties in refetching those pages
> >
> >Rgds
> >
> >Prabhu
> >
> >
> >On 1/9/06, Howie Wang <[EMAIL PROTECTED]> wrote:
> > >
> > > Hi,
> > >
> > > The "type" query filter will only handle parts of the query that are
> > > specified with "type:" The other parts of the query string will be
> >passed
> > > to the other query filters. By default the basic query filter will
> >search
> > > for things in content, url, host, title and anchor text.
> > >
> > > So, for example, if you have a query string that says "type:movies
> harry
> > > potter",
> > > the "type:movies" will use the "type" query filter you wrote, and the
> > > "harry
> > > potter"
> > > will probably go to the basic query filter which will find matches
> >within
> > > content,
> > > url, host, anchor text, and title. The results are just the
> intersection
> > > of
> > > the
> > > two, which is what you want.
> > >
> > > You'll have to edit your search form handler to add "type:movies" or
> > > "type:music"
> > > to the query string before it passes it to Nutch, but that's pretty
> easy
> > > to
> > > do.
> > >
> > > Howie
> > >
> > > >Hi Wang
> > > >
> > > >But i thought when you include a query-plugin and you have a field
> >called
> > > >
> > > >type:
> > > >
> > > >It will search content only in that filed
> > > >
> > > >So You are asking me to make all the content a subset of this one .Is
> >it
> > > ?
> > > >
> > > >For example -query-url will basically search in url field in the
> > > documents
> > > >
> > > >So how can this be a solution.
> > > >
> > > >
> > > >
> > > >Rgds
> > > >Prabhu
> > > >
> > > >
> > > >On 1/9/06, Howie Wang <[EMAIL PROTECTED]> wrote:
> > > > >
> > > > > To do what I mentioned, you basically have to write two plugins,
> > > > > an IndexFilter plugin and a QueryFilter plugin. I think this page
> >has
> > > > > some info on writing plugins:
> > > > >
> > > > > http://wiki.apache.org/nutch/WritingPlugins
> > > > >
> > > > > It will probably be easiest if you copy the
> src/plugins/index-basic
> > > > > directory, and just change all the build files and filenames as
> > > needed.
> > > >If
> > > > > you
> > > > > look at BasicIndexingFilter.java file, you'll see that the
> > > modifications
> > > > > needed
> > > > > aren't bad at all. There are a whole bunch of lines that do
> >something
> > > > > like:
> > > > >
> > > > >    doc.add(Field.Text("myfield"), "somevalue");
> > > > >
> > > > > You should figure out if the url is from a movie page and then
> > > > > add your field:
> > > > >
> > > > >    if (isFromMovieSite(url)) {
> > > > >        doc.add(Field.Text("type"), "movies");
> > > > >    } else if (isFromMusicSite(url)) {
> > > > >        doc.add(Field.Text("type"), "music");
> > > > >    }  else {
> > > > >        // Need to make sure all docs have the field,
> > > > >        // Otherwise it will crash when you search
> > > > >        doc.add(Field.Text("type"), "miscellaneous");
> > > > >    }
> > > > >
> > > > > Doing the query filter is even easier, just copy the
> > > > > src/plugins/query-site
> > > > > directory, change filenames and build files as needed. And change
> >the
> > > > > line that says:
> > > > >
> > > > >    super("site");
> > > > >
> > > > > to:
> > > > >
> > > > >    super("type");
> > > > >
> > > > > That's pretty much it. You'll have to edit your conf/nutch-*.xml
> >files
> > > >to
> > > > > include your new plugins.
> > > > >
> > > > >
> > > > > >Can you explain what exactly you have in mind
> > > > > >
> > > > > >Say that i have fetched sites under movie category (a list of
> > > websites
> > > > > >which
> > > > > >i have ),how do i add
> > > > > >a field to it  and have fetched sites for songs.
> > > > > >How do i specifically add a field to first set of pages (ie that
> of
> > > > > movies)
> > > > > >and a separate field to the second (ie that of songs)
> > > > > >
> > > > > >And field search ,How can i search by this field
> > > > > >
> > > > > >How will nutch understand this query
> > > > > >newfield:uniquename
> > > > > >
> > > > > >I thought you needed to create a query-plugin for each field u
> >create
> > > .
> > > > > >(like query-url)
> > > > > >
> > > > > >I still did not get what u meant .If you can clearly mention ,it
> >will
> > > >be
> > > > > >helpful
> > > > > >
> > > > > >Thanks .
> > > > > >Raghavendra Prabhu R
> > > > >
> > > > >
> > > > >
> > >
> > >
> > >
>
>
>

[Nutch-general] Re: using index filters to add a field

Reply via email to