[ 
https://issues.apache.org/jira/browse/NUTCH-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alparslan Avcı updated NUTCH-1674:
----------------------------------

    Attachment: NUTCH-1674_2.patch

The Nguyen's patch works good, however I want to comment about some points. 
First, the patch applies filter by using WebPage.Field.BATCH_ID field. This 
filter brings also the rows that are not marked for generate, or fetch, or etc. 
I think FilterOp.EQUALS_IN_MAP operator can be used to filter by marks for each 
job. For example; for FetcherJob, WebPage.Field.MARKERS can be used as filter 
field and Mark.GENERATE_MARK and batchId can be used as operands. This will 
filter only the rows that have Mark.GENERATE_MARK and this Mark.GENERATE_MARK 
value has to be equal to batchId. Moreover; by using this method, we can remove 
the NutchJob.shouldProcess(mark, batchId) controls in the mappers. 
Secondly, new filters more than batchId can be implemented in future. I think 
for extensibility, the filters have to be implemented on each job class, not in 
StorageUtils. StorageUtils has to only apply the filter (or filterset) that is 
passed as method parameter since it is a util class.
In the patch I added, I applied the possible filters (which are only batchId 
filters for now) to the jobs. After the implementation of new Hbase filters and 
filterset on Gora, we can add new filters (eg.:Non-existance of Mark.FETCH_MARK 
filter for FetcherJob) and clean the map functions from some controls. 

> Use batchId filter to enable scan (GORA-119) for Fetch,Parse,Update,Index
> -------------------------------------------------------------------------
>
>                 Key: NUTCH-1674
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1674
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 2.3
>            Reporter: Nguyen Manh Tien
>             Fix For: 2.3
>
>         Attachments: NUTCH-1674.patch, NUTCH-1674_2.patch
>
>
> Nutch always scan the whole crawldb in each phrase (generate, fetch, parse, 
> update, index). When crawldb is big, the time to scan is bigger than the 
> actual processing time.
> We really need to skip records while scanning using GORA-119 for example we 
> can only get records belong to a specified batchId.
> In my crawl the filter reduce the time to scan from 90 min to 30 min.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to