Re: [Nutch-general] I can not query myplugin in field category:test

Ernesto De Santis Mon, 16 Oct 2006 06:00:32 -0700

Hi Alvaro

Very good, it seems can be extended to support both strategies, current 
and regexp.
Maybe using another tag to define url expressions.


                .....
               <whitelistexp>
                       http://www.categorySite.es/setcionA(.)*sectionB(.)*
               </whitelistexp>


Thanks,
Ernesto.

Alvaro Cabrerizo escribió:
> You can use the existing "subcollection" plugin in nutch 0.8.X and 
> extend it
> to use regular expressions. Basically you have to modify the class
> org.apache.nutch.collection.Subcollection. Change the method filter 
> (lines
> 146 154) and substitute if(urlString.indexOf(row) =! -1) with somethig 
> like
> if(Pattern.matches(row, urlString)).
>
> This approach lets you:
>
> -Use the existing file subcollection.xml to define your
> url-expression/categories
> -Use the package java.util.regex to define matching urls
>
>
> Here is a sample of subcollection.xml, after modifying subcollection 
> plugin.
>
> <subcollection>
>                <name>myCategory</name>
>                <id>myCategory</id>
>                <whitelist>
>                        
> http://www.categorySite.es/setcionA(.)*sectionB(.)*
>                </whitelist>
>                <blacklist>
>
> http://www.categorySite.es/setcionA(.)*sectionB(.)*sectionC(.)*
>                </blacklist>
> </subcollection>
>
>
>
>
> 2006/10/14, Ernesto De Santis <[EMAIL PROTECTED]>:
>>
>> Hi Chad
>>
>> The link was a configuration example.
>>
>> more explained example:
>> http://www.misite.com/videos/.*=videos  (rule A)
>>
>> if the url fetched match which rule A, then index a Field named =
>> 'category' with value = 'videos'.
>>
>> Later you can search over this field category to filter yours searches.
>>
>> I will send this plugin in another new thread mail. I post the plugin
>> here, in the list. I don't know another way to share it with you.
>>
>> Regards
>> Ernesto.
>>
>>
>>
>>
>>
>> [EMAIL PROTECTED] escribió:
>> > couldn't get the link to work but yes if you could share that would be
>> > great.
>> >
>> > Chad Savage
>> >
>> >
>> >
>> >
>> > Ernesto De Santis wrote:
>> >> I did a url-category-indexer.
>> >>
>> >> It works with a .properties file that map urls writed as regexp and
>> >> categories.
>> >> example:
>> >>
>> >> http://www.misite.com/videos/.*=videos
>> >>
>> >> If it seems useful, I can share it.
>> >>
>> >> Maybe, it could be better config it in a .xml file.
>> >>
>> >> Regards,
>> >> Ernesto.
>> >>
>> >> Stefan Neufeind escribió:
>> >>> Alvaro Cabrerizo wrote:
>> >>>
>> >>>> Have you included a node to describe your new searcher filter into
>> >>>> plugin.xml?
>> >>>>
>> >>>> 2006/10/11, xu nutch <[EMAIL PROTECTED]>:
>> >>>>
>> >>>>> I have a question about myplugin for indexfilter and queryfilter.
>> >>>>> Can u Help me !
>> >>>>> -------------------------------------
>> >>>>> MoreIndexingFilter.java in add
>> >>>>> doc.add(new Field("category", "test", false, true, false));
>> >>>>> -------------------------------------
>> >>>>>
>> >>>>> --------------------------------------
>> >>>>>
>> >>>>>
>> >>>>> package org.apache.nutch.searcher.more;
>> >>>>>
>> >>>>> import org.apache.nutch.searcher.RawFieldQueryFilter;
>> >>>>>
>> >>>>> /** Handles "category:" query clauses, causing them to search the
>> >>>>> field indexed by
>> >>>>>  * BasicIndexingFilter. */
>> >>>>> public class CategoryQueryFilter extends RawFieldQueryFilter {
>> >>>>>  public CategoryQueryFilter() {
>> >>>>>    super("category");
>> >>>>>  }
>> >>>>> }
>> >>>>> -----------------------------------------------
>> >>>>> -----------------------------------------------
>> >>>>>
>> >>>>> <property>
>> >>>>>  <name>plugin.includes</name>
>> >>>>>
>> <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-(basic|more)|query-(basic|site|url|more)</value>
>>  
>>
>> >>>>>
>> >>>>>
>> >>>>>  <description>Regular expression naming plugin directory names to
>> >>>>>  include.  Any plugin not matching this expression is excluded.
>> >>>>>  In any case you need at least include the nutch-extensionpoints
>> >>>>> plugin. By
>> >>>>>  default Nutch includes crawling just HTML and plain text via 
>> HTTP,
>> >>>>>  and basic indexing and search plugins.
>> >>>>>  </description>
>> >>>>> </property>
>> >>>>>
>> >>>>> <property>
>> >>>>>  <name>plugin.includes</name>
>> >>>>>
>> <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-(basic|more)|query-(basic|site|url|more)</value>
>>  
>>
>> >>>>>
>> >>>>>
>> >>>>>  <description>Regular expression naming plugin directory names to
>> >>>>>  include.  Any plugin not matching this expression is excluded.
>> >>>>>  In any case you need at least include the nutch-extensionpoints
>> >>>>> plugin. By
>> >>>>>  default Nutch includes crawling just HTML and plain text via 
>> HTTP,
>> >>>>>  and basic indexing and search plugins.
>> >>>>>  </description>
>> >>>>> </property>
>> >>>>> -----------------------------------------------
>> >>>>>
>> >>>>> I use luke to query "category:test" is ok!
>> >>>>> but I use tomcat webstie to query "category:test" ,
>> >>>>> no return result.
>> >>>>>
>> >>>
>> >>> In case you get the search working:
>> >>> How do you plan to categorize URLs/sites? I'm looking for a solution
>> >>> there, since I didn't yet manage to implement something
>> >>> URL-prefix-filter based to map categories to URLs or so.
>> >>>
>> >>>
>> >>> Regards,
>> >>>  Stefan
>> >>>
>> >>>
>> >>>
>> >>
>> >>                __________________________________________________
>> >> Preguntá. Respondé. Descubrí.
>> >> Todo lo que querías saber, y lo que ni imaginabas,
>> >> está en Yahoo! Respuestas (Beta).
>> >> ¡Probalo ya! http://www.yahoo.com.ar/respuestas
>> >>
>> >>
>> >
>>
>>
>>
>>
>> __________________________________________________
>> Preguntá. Respondé. Descubrí.
>> Todo lo que querías saber, y lo que ni imaginabas,
>> está en Yahoo! Respuestas (Beta).
>> ¡Probalo ya!
>> http://www.yahoo.com.ar/respuestas
>>
>>
>

        
        
                
__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya! 
http://www.yahoo.com.ar/respuestas


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] I can not query myplugin in field category:test

Reply via email to