Hi Alvaro
Very good, it seems can be extended to support both strategies, current
and regexp.
Maybe using another tag to define url expressions.
.....
<whitelistexp>
http://www.categorySite.es/setcionA(.)*sectionB(.)*
</whitelistexp>
Thanks,
Ernesto.
Alvaro Cabrerizo escribió:
> You can use the existing "subcollection" plugin in nutch 0.8.X and
> extend it
> to use regular expressions. Basically you have to modify the class
> org.apache.nutch.collection.Subcollection. Change the method filter
> (lines
> 146 154) and substitute if(urlString.indexOf(row) =! -1) with somethig
> like
> if(Pattern.matches(row, urlString)).
>
> This approach lets you:
>
> -Use the existing file subcollection.xml to define your
> url-expression/categories
> -Use the package java.util.regex to define matching urls
>
>
> Here is a sample of subcollection.xml, after modifying subcollection
> plugin.
>
> <subcollection>
> <name>myCategory</name>
> <id>myCategory</id>
> <whitelist>
>
> http://www.categorySite.es/setcionA(.)*sectionB(.)*
> </whitelist>
> <blacklist>
>
> http://www.categorySite.es/setcionA(.)*sectionB(.)*sectionC(.)*
> </blacklist>
> </subcollection>
>
>
>
>
> 2006/10/14, Ernesto De Santis <[EMAIL PROTECTED]>:
>>
>> Hi Chad
>>
>> The link was a configuration example.
>>
>> more explained example:
>> http://www.misite.com/videos/.*=videos (rule A)
>>
>> if the url fetched match which rule A, then index a Field named =
>> 'category' with value = 'videos'.
>>
>> Later you can search over this field category to filter yours searches.
>>
>> I will send this plugin in another new thread mail. I post the plugin
>> here, in the list. I don't know another way to share it with you.
>>
>> Regards
>> Ernesto.
>>
>>
>>
>>
>>
>> [EMAIL PROTECTED] escribió:
>> > couldn't get the link to work but yes if you could share that would be
>> > great.
>> >
>> > Chad Savage
>> >
>> >
>> >
>> >
>> > Ernesto De Santis wrote:
>> >> I did a url-category-indexer.
>> >>
>> >> It works with a .properties file that map urls writed as regexp and
>> >> categories.
>> >> example:
>> >>
>> >> http://www.misite.com/videos/.*=videos
>> >>
>> >> If it seems useful, I can share it.
>> >>
>> >> Maybe, it could be better config it in a .xml file.
>> >>
>> >> Regards,
>> >> Ernesto.
>> >>
>> >> Stefan Neufeind escribió:
>> >>> Alvaro Cabrerizo wrote:
>> >>>
>> >>>> Have you included a node to describe your new searcher filter into
>> >>>> plugin.xml?
>> >>>>
>> >>>> 2006/10/11, xu nutch <[EMAIL PROTECTED]>:
>> >>>>
>> >>>>> I have a question about myplugin for indexfilter and queryfilter.
>> >>>>> Can u Help me !
>> >>>>> -------------------------------------
>> >>>>> MoreIndexingFilter.java in add
>> >>>>> doc.add(new Field("category", "test", false, true, false));
>> >>>>> -------------------------------------
>> >>>>>
>> >>>>> --------------------------------------
>> >>>>>
>> >>>>>
>> >>>>> package org.apache.nutch.searcher.more;
>> >>>>>
>> >>>>> import org.apache.nutch.searcher.RawFieldQueryFilter;
>> >>>>>
>> >>>>> /** Handles "category:" query clauses, causing them to search the
>> >>>>> field indexed by
>> >>>>> * BasicIndexingFilter. */
>> >>>>> public class CategoryQueryFilter extends RawFieldQueryFilter {
>> >>>>> public CategoryQueryFilter() {
>> >>>>> super("category");
>> >>>>> }
>> >>>>> }
>> >>>>> -----------------------------------------------
>> >>>>> -----------------------------------------------
>> >>>>>
>> >>>>> <property>
>> >>>>> <name>plugin.includes</name>
>> >>>>>
>> <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-(basic|more)|query-(basic|site|url|more)</value>
>>
>>
>> >>>>>
>> >>>>>
>> >>>>> <description>Regular expression naming plugin directory names to
>> >>>>> include. Any plugin not matching this expression is excluded.
>> >>>>> In any case you need at least include the nutch-extensionpoints
>> >>>>> plugin. By
>> >>>>> default Nutch includes crawling just HTML and plain text via
>> HTTP,
>> >>>>> and basic indexing and search plugins.
>> >>>>> </description>
>> >>>>> </property>
>> >>>>>
>> >>>>> <property>
>> >>>>> <name>plugin.includes</name>
>> >>>>>
>> <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-(basic|more)|query-(basic|site|url|more)</value>
>>
>>
>> >>>>>
>> >>>>>
>> >>>>> <description>Regular expression naming plugin directory names to
>> >>>>> include. Any plugin not matching this expression is excluded.
>> >>>>> In any case you need at least include the nutch-extensionpoints
>> >>>>> plugin. By
>> >>>>> default Nutch includes crawling just HTML and plain text via
>> HTTP,
>> >>>>> and basic indexing and search plugins.
>> >>>>> </description>
>> >>>>> </property>
>> >>>>> -----------------------------------------------
>> >>>>>
>> >>>>> I use luke to query "category:test" is ok!
>> >>>>> but I use tomcat webstie to query "category:test" ,
>> >>>>> no return result.
>> >>>>>
>> >>>
>> >>> In case you get the search working:
>> >>> How do you plan to categorize URLs/sites? I'm looking for a solution
>> >>> there, since I didn't yet manage to implement something
>> >>> URL-prefix-filter based to map categories to URLs or so.
>> >>>
>> >>>
>> >>> Regards,
>> >>> Stefan
>> >>>
>> >>>
>> >>>
>> >>
>> >> __________________________________________________
>> >> Preguntá. Respondé. Descubrí.
>> >> Todo lo que querías saber, y lo que ni imaginabas,
>> >> está en Yahoo! Respuestas (Beta).
>> >> ¡Probalo ya! http://www.yahoo.com.ar/respuestas
>> >>
>> >>
>> >
>>
>>
>>
>>
>> __________________________________________________
>> Preguntá. Respondé. Descubrí.
>> Todo lo que querías saber, y lo que ni imaginabas,
>> está en Yahoo! Respuestas (Beta).
>> ¡Probalo ya!
>> http://www.yahoo.com.ar/respuestas
>>
>>
>
__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general