You can use the existing "subcollection" plugin in nutch 0.8.X and extend it
to use regular expressions. Basically you have to modify the class
org.apache.nutch.collection.Subcollection. Change the method filter (lines
146 154) and substitute if(urlString.indexOf(row) =! -1) with somethig like
if(Pattern.matches(row, urlString)).
This approach lets you:
-Use the existing file subcollection.xml to define your
url-expression/categories
-Use the package java.util.regex to define matching urls
Here is a sample of subcollection.xml, after modifying subcollection plugin.
<subcollection>
<name>myCategory</name>
<id>myCategory</id>
<whitelist>
http://www.categorySite.es/setcionA(.)*sectionB(.)*
</whitelist>
<blacklist>
http://www.categorySite.es/setcionA(.)*sectionB(.)*sectionC(.)*
</blacklist>
</subcollection>
2006/10/14, Ernesto De Santis <[EMAIL PROTECTED]>:
Hi Chad
The link was a configuration example.
more explained example:
http://www.misite.com/videos/.*=videos (rule A)
if the url fetched match which rule A, then index a Field named =
'category' with value = 'videos'.
Later you can search over this field category to filter yours searches.
I will send this plugin in another new thread mail. I post the plugin
here, in the list. I don't know another way to share it with you.
Regards
Ernesto.
[EMAIL PROTECTED] escribió:
> couldn't get the link to work but yes if you could share that would be
> great.
>
> Chad Savage
>
>
>
>
> Ernesto De Santis wrote:
>> I did a url-category-indexer.
>>
>> It works with a .properties file that map urls writed as regexp and
>> categories.
>> example:
>>
>> http://www.misite.com/videos/.*=videos
>>
>> If it seems useful, I can share it.
>>
>> Maybe, it could be better config it in a .xml file.
>>
>> Regards,
>> Ernesto.
>>
>> Stefan Neufeind escribió:
>>> Alvaro Cabrerizo wrote:
>>>
>>>> Have you included a node to describe your new searcher filter into
>>>> plugin.xml?
>>>>
>>>> 2006/10/11, xu nutch <[EMAIL PROTECTED]>:
>>>>
>>>>> I have a question about myplugin for indexfilter and queryfilter.
>>>>> Can u Help me !
>>>>> -------------------------------------
>>>>> MoreIndexingFilter.java in add
>>>>> doc.add(new Field("category", "test", false, true, false));
>>>>> -------------------------------------
>>>>>
>>>>> --------------------------------------
>>>>>
>>>>>
>>>>> package org.apache.nutch.searcher.more;
>>>>>
>>>>> import org.apache.nutch.searcher.RawFieldQueryFilter;
>>>>>
>>>>> /** Handles "category:" query clauses, causing them to search the
>>>>> field indexed by
>>>>> * BasicIndexingFilter. */
>>>>> public class CategoryQueryFilter extends RawFieldQueryFilter {
>>>>> public CategoryQueryFilter() {
>>>>> super("category");
>>>>> }
>>>>> }
>>>>> -----------------------------------------------
>>>>> -----------------------------------------------
>>>>>
>>>>> <property>
>>>>> <name>plugin.includes</name>
>>>>>
<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-(basic|more)|query-(basic|site|url|more)</value>
>>>>>
>>>>>
>>>>> <description>Regular expression naming plugin directory names to
>>>>> include. Any plugin not matching this expression is excluded.
>>>>> In any case you need at least include the nutch-extensionpoints
>>>>> plugin. By
>>>>> default Nutch includes crawling just HTML and plain text via HTTP,
>>>>> and basic indexing and search plugins.
>>>>> </description>
>>>>> </property>
>>>>>
>>>>> <property>
>>>>> <name>plugin.includes</name>
>>>>>
<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-(basic|more)|query-(basic|site|url|more)</value>
>>>>>
>>>>>
>>>>> <description>Regular expression naming plugin directory names to
>>>>> include. Any plugin not matching this expression is excluded.
>>>>> In any case you need at least include the nutch-extensionpoints
>>>>> plugin. By
>>>>> default Nutch includes crawling just HTML and plain text via HTTP,
>>>>> and basic indexing and search plugins.
>>>>> </description>
>>>>> </property>
>>>>> -----------------------------------------------
>>>>>
>>>>> I use luke to query "category:test" is ok!
>>>>> but I use tomcat webstie to query "category:test" ,
>>>>> no return result.
>>>>>
>>>
>>> In case you get the search working:
>>> How do you plan to categorize URLs/sites? I'm looking for a solution
>>> there, since I didn't yet manage to implement something
>>> URL-prefix-filter based to map categories to URLs or so.
>>>
>>>
>>> Regards,
>>> Stefan
>>>
>>>
>>>
>>
>> __________________________________________________
>> Preguntá. Respondé. Descubrí.
>> Todo lo que querías saber, y lo que ni imaginabas,
>> está en Yahoo! Respuestas (Beta).
>> ¡Probalo ya! http://www.yahoo.com.ar/respuestas
>>
>>
>
__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general