Re: I can not query myplugin in field category:test

Ernesto De Santis Mon, 16 Oct 2006 06:00:18 -0700

Hi Alvaro

Very good, it seems can be extended to support both strategies, currentand regexp.

Maybe using another tag to define url expressions.


               .....
              <whitelistexp>
                      http://www.categorySite.es/setcionA(.)*sectionB(.)*
              </whitelistexp>


Thanks,
Ernesto.

Alvaro Cabrerizo escribió:

You can use the existing "subcollection" plugin in nutch 0.8.X andextend it

to use regular expressions. Basically you have to modify the class

org.apache.nutch.collection.Subcollection. Change the method filter(lines146 154) and substitute if(urlString.indexOf(row) =! -1) with somethiglike

if(Pattern.matches(row, urlString)).

This approach lets you:

-Use the existing file subcollection.xml to define your
url-expression/categories
-Use the package java.util.regex to define matching urls

Here is a sample of subcollection.xml, after modifying subcollectionplugin.


<subcollection>
               <name>myCategory</name>
               <id>myCategory</id>
               <whitelist>

http://www.categorySite.es/setcionA(.)*sectionB(.)*

               </whitelist>
               <blacklist>

http://www.categorySite.es/setcionA(.)*sectionB(.)*sectionC(.)*
               </blacklist>
</subcollection>




2006/10/14, Ernesto De Santis <[EMAIL PROTECTED]>:


Hi Chad

The link was a configuration example.

more explained example:
http://www.misite.com/videos/.*=videos  (rule A)

if the url fetched match which rule A, then index a Field named =
'category' with value = 'videos'.

Later you can search over this field category to filter yours searches.

I will send this plugin in another new thread mail. I post the plugin
here, in the list. I don't know another way to share it with you.

Regards
Ernesto.





[EMAIL PROTECTED] escribió:
> couldn't get the link to work but yes if you could share that would be
> great.
>
> Chad Savage
>
>
>
>
> Ernesto De Santis wrote:
>> I did a url-category-indexer.
>>
>> It works with a .properties file that map urls writed as regexp and
>> categories.
>> example:
>>
>> http://www.misite.com/videos/.*=videos
>>
>> If it seems useful, I can share it.
>>
>> Maybe, it could be better config it in a .xml file.
>>
>> Regards,
>> Ernesto.
>>
>> Stefan Neufeind escribió:
>>> Alvaro Cabrerizo wrote:
>>>
>>>> Have you included a node to describe your new searcher filter into
>>>> plugin.xml?
>>>>
>>>> 2006/10/11, xu nutch <[EMAIL PROTECTED]>:
>>>>
>>>>> I have a question about myplugin for indexfilter and queryfilter.
>>>>> Can u Help me !
>>>>> -------------------------------------
>>>>> MoreIndexingFilter.java in add
>>>>> doc.add(new Field("category", "test", false, true, false));
>>>>> -------------------------------------
>>>>>
>>>>> --------------------------------------
>>>>>
>>>>>
>>>>> package org.apache.nutch.searcher.more;
>>>>>
>>>>> import org.apache.nutch.searcher.RawFieldQueryFilter;
>>>>>
>>>>> /** Handles "category:" query clauses, causing them to search the
>>>>> field indexed by
>>>>>  * BasicIndexingFilter. */
>>>>> public class CategoryQueryFilter extends RawFieldQueryFilter {
>>>>>  public CategoryQueryFilter() {
>>>>>    super("category");
>>>>>  }
>>>>> }
>>>>> -----------------------------------------------
>>>>> -----------------------------------------------
>>>>>
>>>>> <property>
>>>>>  <name>plugin.includes</name>
>>>>>

>>>>>
>>>>>
>>>>>  <description>Regular expression naming plugin directory names to
>>>>>  include.  Any plugin not matching this expression is excluded.
>>>>>  In any case you need at least include the nutch-extensionpoints
>>>>> plugin. By

>>>>> default Nutch includes crawling just HTML and plain text viaHTTP,

>>>>>  and basic indexing and search plugins.
>>>>>  </description>
>>>>> </property>
>>>>>
>>>>> <property>
>>>>>  <name>plugin.includes</name>
>>>>>

>>>>>
>>>>>
>>>>>  <description>Regular expression naming plugin directory names to
>>>>>  include.  Any plugin not matching this expression is excluded.
>>>>>  In any case you need at least include the nutch-extensionpoints
>>>>> plugin. By

>>>>> default Nutch includes crawling just HTML and plain text viaHTTP,

>>>>>  and basic indexing and search plugins.
>>>>>  </description>
>>>>> </property>
>>>>> -----------------------------------------------
>>>>>
>>>>> I use luke to query "category:test" is ok!
>>>>> but I use tomcat webstie to query "category:test" ,
>>>>> no return result.
>>>>>
>>>
>>> In case you get the search working:
>>> How do you plan to categorize URLs/sites? I'm looking for a solution
>>> there, since I didn't yet manage to implement something
>>> URL-prefix-filter based to map categories to URLs or so.
>>>
>>>
>>> Regards,
>>>  Stefan
>>>
>>>
>>>
>>
>>                __________________________________________________
>> Preguntá. Respondé. Descubrí.
>> Todo lo que querías saber, y lo que ni imaginabas,
>> está en Yahoo! Respuestas (Beta).
>> ¡Probalo ya! http://www.yahoo.com.ar/respuestas
>>
>>
>




__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas


        
        
                
__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).

¡Probalo ya!http://www.yahoo.com.ar/respuestas

Re: I can not query myplugin in field category:test

Reply via email to