Not sure what u mean by pig script, but I'd like to be able to make a
multi-criteria selection of Url for fetching...
 The scoring method forces into a kind of mono dimensional approach
which is not really easy to deal with.

The regex filters are good but it assumes you want select URLs on data
which is in the URL... Pretty limited in fact

I basically would like to do 'content' based crawling. Say for
example: that I'm interested in "topic A".
I'd'like to label URLs that match "Topic A" (user supplied logic).
Later on I would want to crawl "topic A" urls at a certain frequency
and non labeled urls for exploring in a different way.

 This looks like hard to do right now

2010/4/8, Doğacan Güney <doga...@gmail.com>:
> Hi,
>
> On Wed, Apr 7, 2010 at 21:19, MilleBii <mille...@gmail.com> wrote:
>> Just a question ?
>> Will the new HBase implementation allow more sophisticated crawling
>> strategies than the current score based.
>>
>> Give you a few  example of what I'd like to do :
>> Define different crawling frequency for different set of URLs, say
>> weekly for some url, monthly or more for others.
>>
>> Select URLs to re-crawl based on attributes previously extracted.Just
>> one example: recrawl urls that contained a certain keyword (or set of)
>>
>> Select URLs that have not yet been crawled, at the frontier of the
>> crawl therefore
>>
>
> At some point, it would be nice to change generator so that it is only a
> handful
> of methods and a pig (or something else) script. So, we would provide
> most of the functions
> you may need during generation (accessing various data) but actual
> generation would be a pig
> process. This way, anyone can easily change generate any way they want
> (even make it more jobs
> than 2 if they want more complex schemes).
>
>>
>>
>>
>> 2010/4/7, Doğacan Güney <doga...@gmail.com>:
>>> Hey everyone,
>>>
>>> On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki <a...@getopt.org> wrote:
>>>> On 2010-04-06 15:43, Julien Nioche wrote:
>>>>> Hi guys,
>>>>>
>>>>> I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will
>>>>> be
>>>>> based on what is currently referred to as NutchBase. Shall we create a
>>>>> branch for 2.0 in the Nutch SVN repository and have a label accordingly
>>>>> for
>>>>> JIRA so that we can file issues / feature requests on 2.0? Do you think
>>>>> that
>>>>> the current NutchBase could be used as a basis for the 2.0 branch?
>>>>
>>>> I'm not sure what is the status of the nutchbase - it's missed a lot of
>>>> fixes and changes in trunk since it's been last touched ...
>>>>
>>>
>>> I know... But I still intend to finish it, I just need to schedule
>>> some time for it.
>>>
>>> My vote would be to go with nutchbase.
>>>
>>>>>
>>>>> Talking about features, what else would we add apart from :
>>>>>
>>>>> * support for HBase : via ORM or not (see
>>>>> NUTCH-808<https://issues.apache.org/jira/browse/NUTCH-808>
>>>>> )
>>>>
>>>> This IMHO is promising, this could open the doors to small-to-medium
>>>> installations that are currently too cumbersome to handle.
>>>>
>>>
>>> Yeah, there is already a simple ORM within nutchbase that is
>>> avro-based and should
>>> be generic enough to also support MySQL, cassandra and berkeleydb. But
>>> any good ORM will
>>> be a very good addition.
>>>
>>>>> * plugin cleanup : Tika only for parsing - get rid of everything else?
>>>>
>>>> Basically, yes - keep only stuff like HtmlParseFilters (probably with a
>>>> different API) so that we can post-process the DOM created in Tika from
>>>> whatever original format.
>>>>
>>>> Also, the goal of the crawler-commons project is to provide APIs and
>>>> implementations of stuff that is needed for every open source crawler
>>>> project, like: robots handling, url filtering and url normalization, URL
>>>> state management, perhaps deduplication. We should coordinate our
>>>> efforts, and share code freely so that other projects (bixo, heritrix,
>>>> droids) may contribute to this shared pool of functionality, much like
>>>> Tika does for the common need of parsing complex formats.
>>>>
>>>>> * remove index / search and delegate to SOLR
>>>>
>>>> +1 - we may still keep a thin abstract layer to allow other
>>>> indexing/search backends, but the current mess of indexing/query filters
>>>> and competing indexing frameworks (lucene, fields, solr) should go away.
>>>> We should go directly from DOM to a NutchDocument, and stop there.
>>>>
>>>
>>> Agreed. I would like to add support for katta and other indexing
>>> backends at some point but
>>> NutchDocument should be our canonical representation. The rest should
>>> be up to indexing backends.
>>>
>>>> Regarding search - currently the search API is too low-level, with the
>>>> custom text and query analysis chains. This needlessly introduces the
>>>> (in)famous Nutch Query classes and Nutch query syntax limitations, We
>>>> should get rid of it and simply leave this part of the processing to the
>>>> search backend. Probably we will use the SolrCloud branch that supports
>>>> sharding and global IDF.
>>>>
>>>>> * new functionalities e.g. sitemap support, canonical tag etc...
>>>>
>>>> Plus a better handling of redirects, detecting duplicated sites,
>>>> detection of spam cliques, tools to manage the webgraph, etc.
>>>>
>>>>>
>>>>> I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an
>>>>> update?
>>>>
>>>> Definitely. :)
>>>>
>>>> --
>>>> Best regards,
>>>> Andrzej Bialecki     <><
>>>>  ___. ___ ___ ___ _ _   __________________________________
>>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>>> http://www.sigram.com  Contact: info at sigram dot com
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Doğacan Güney
>>>
>>
>>
>> --
>> -MilleBii-
>>
>
>
>
> --
> Doğacan Güney
>


-- 
-MilleBii-

Reply via email to