Hi Alvaro
Very good, it seems can be extended to support both strategies, current
and regexp.
Maybe using another tag to define url expressions.
.....
<whitelistexp>
http://www.categorySite.es/setcionA(.)*sectionB(.)*
</whitelistexp>
Thanks,
Ernesto.
Alvaro Cabrerizo escribió:
You can use the existing "subcollection" plugin in nutch 0.8.X and
extend it
to use regular expressions. Basically you have to modify the class
org.apache.nutch.collection.Subcollection. Change the method filter
(lines
146 154) and substitute if(urlString.indexOf(row) =! -1) with somethig
like
if(Pattern.matches(row, urlString)).
This approach lets you:
-Use the existing file subcollection.xml to define your
url-expression/categories
-Use the package java.util.regex to define matching urls
Here is a sample of subcollection.xml, after modifying subcollection
plugin.
<subcollection>
<name>myCategory</name>
<id>myCategory</id>
<whitelist>
http://www.categorySite.es/setcionA(.)*sectionB(.)*
</whitelist>
<blacklist>
http://www.categorySite.es/setcionA(.)*sectionB(.)*sectionC(.)*
</blacklist>
</subcollection>
2006/10/14, Ernesto De Santis <[EMAIL PROTECTED]>:
Hi Chad
The link was a configuration example.
more explained example:
http://www.misite.com/videos/.*=videos (rule A)
if the url fetched match which rule A, then index a Field named =
'category' with value = 'videos'.
Later you can search over this field category to filter yours searches.
I will send this plugin in another new thread mail. I post the plugin
here, in the list. I don't know another way to share it with you.
Regards
Ernesto.
[EMAIL PROTECTED] escribió:
> couldn't get the link to work but yes if you could share that would be
> great.
>
> Chad Savage
>
>
>
>
> Ernesto De Santis wrote:
>> I did a url-category-indexer.
>>
>> It works with a .properties file that map urls writed as regexp and
>> categories.
>> example:
>>
>> http://www.misite.com/videos/.*=videos
>>
>> If it seems useful, I can share it.
>>
>> Maybe, it could be better config it in a .xml file.
>>
>> Regards,
>> Ernesto.
>>
>> Stefan Neufeind escribió:
>>> Alvaro Cabrerizo wrote:
>>>
>>>> Have you included a node to describe your new searcher filter into
>>>> plugin.xml?
>>>>
>>>> 2006/10/11, xu nutch <[EMAIL PROTECTED]>:
>>>>
>>>>> I have a question about myplugin for indexfilter and queryfilter.
>>>>> Can u Help me !
>>>>> -------------------------------------
>>>>> MoreIndexingFilter.java in add
>>>>> doc.add(new Field("category", "test", false, true, false));
>>>>> -------------------------------------
>>>>>
>>>>> --------------------------------------
>>>>>
>>>>>
>>>>> package org.apache.nutch.searcher.more;
>>>>>
>>>>> import org.apache.nutch.searcher.RawFieldQueryFilter;
>>>>>
>>>>> /** Handles "category:" query clauses, causing them to search the
>>>>> field indexed by
>>>>> * BasicIndexingFilter. */
>>>>> public class CategoryQueryFilter extends RawFieldQueryFilter {
>>>>> public CategoryQueryFilter() {
>>>>> super("category");
>>>>> }
>>>>> }
>>>>> -----------------------------------------------
>>>>> -----------------------------------------------
>>>>>
>>>>> <property>
>>>>> <name>plugin.includes</name>
>>>>>
<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-(basic|more)|query-(basic|site|url|more)</value>
>>>>>
>>>>>
>>>>> <description>Regular expression naming plugin directory names to
>>>>> include. Any plugin not matching this expression is excluded.
>>>>> In any case you need at least include the nutch-extensionpoints
>>>>> plugin. By
>>>>> default Nutch includes crawling just HTML and plain text via
HTTP,
>>>>> and basic indexing and search plugins.
>>>>> </description>
>>>>> </property>
>>>>>
>>>>> <property>
>>>>> <name>plugin.includes</name>
>>>>>
<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-(basic|more)|query-(basic|site|url|more)</value>
>>>>>
>>>>>
>>>>> <description>Regular expression naming plugin directory names to
>>>>> include. Any plugin not matching this expression is excluded.
>>>>> In any case you need at least include the nutch-extensionpoints
>>>>> plugin. By
>>>>> default Nutch includes crawling just HTML and plain text via
HTTP,
>>>>> and basic indexing and search plugins.
>>>>> </description>
>>>>> </property>
>>>>> -----------------------------------------------
>>>>>
>>>>> I use luke to query "category:test" is ok!
>>>>> but I use tomcat webstie to query "category:test" ,
>>>>> no return result.
>>>>>
>>>
>>> In case you get the search working:
>>> How do you plan to categorize URLs/sites? I'm looking for a solution
>>> there, since I didn't yet manage to implement something
>>> URL-prefix-filter based to map categories to URLs or so.
>>>
>>>
>>> Regards,
>>> Stefan
>>>
>>>
>>>
>>
>> __________________________________________________
>> Preguntá. Respondé. Descubrí.
>> Todo lo que querías saber, y lo que ni imaginabas,
>> está en Yahoo! Respuestas (Beta).
>> ¡Probalo ya! http://www.yahoo.com.ar/respuestas
>>
>>
>
__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas
__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas