[Nutch-dev] [jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

JIRA Mon, 14 May 2007 10:53:53 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495696
 ]


Doğacan Güney commented on NUTCH-443:
-------------------------------------

I am not sure I follow you Andrzej. My patch already does a very similar thing 
in Fetchers . Actually, the only difference between our patches - w.r.t Fetcher 
code - is in your patch the parsing condition also includes (content != null) 
check. Beyond that our code is pretty much the same. (I will send an updated 
patch that does that, btw). Besides the code change in Fetchers, we also need 
to remove the redir != null condition for indexer to work correctly. See my 
comment above for a hopefully more understandable description.

Indexer has to read crawl_parse, because that is where ParseSegment pushes 
sub-urls fetch datums. So, it is not related to the redirection issue. It is 
related to the "Oh man, I forgot to include that line in my patch" issue:).

If reading crawl_parse turns out to be a big burden to Indexer, perhaps we can 
make ParseSegment push these datums to another file.  (crawl_late_fetch? Yeah, 
I know that name sucks:) It would be awesome if hadoop allowed us to reopen 
SequenceFiles to append data(so we could have just pushed them to crawl_fetch). 
AFAIK, hadoop doesn't have that yet.




> allow parsers to return multiple Parse object, this will speed up the rss 
> parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
> NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, 
> NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, 
> NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, 
> NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, 
> parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff, 
> patch.txt, redirect_and_index.patch
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser 
> can return multiple parse objects, that will all be indexed separately. 
> Advantage: no need to fetch all feed-items separately.
> see the discussion at 
> http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] [jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Reply via email to