[Nutch-dev] [jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

JIRA Wed, 28 Feb 2007 07:29:23 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476611
 ]


Doğacan Güney commented on NUTCH-443:
-------------------------------------

> * you create the "fake" CrawlDatum-s in ParseOutputFormat, and then set 
> fetchTime to the current time. This is incorrect - 
> parsing may have been performed long after the content was fetched. The 
> correct place to create and store these "fake" 
> CrawlDatum-s is in the FetcherThread.output(), where you loop through 
> Entry<Text, Parse>, i.e.:

What if I run my fetcher in non-parsing mode?(which, coincidentally, is always 
for me) I can add the code to fetcher but it will still be wrong in parse. I 
guess I will have to put FETCH_TIME_KEY back in. What do you think? Is there a 
better way to handle this?

> * I'm pretty sure that ParseResult.filter() must NOT be called under normal 
> circumstances ... We need to store the information 
> that parsing was unsuccessful - if we remove this information from the 
> ParseResult we will never know that parsing failed for 
> this  content (or a part thereof). 

The current code does not store unsuccessful parses. I mean, take ParseSegment, 
it only outputs code if parse status is success. So Nutch removes this 
information anyway, I just changed the place where Nutch removes this 
information. My approach is cleaner (IMO), but I don't really feel that 
strongly about it, so I can change it. 

> * we have a backward-compatibility issue with ParseImpl.isFetched - i.e. data 
> created with earlier versions of Nutch won't be 
> compatible with the new format, and there is no versioning information in the 
> already existing data. We need to do one of the > following:
>  - bite the bullet, and don't care about backward compatibility - not so nice 
> ... all existing segments will have to be re-parsed. > Ouch.
> - add look-ahead code to test the data coming from DataInput if it contains 
> this boolean flag or a likely Text length - 
> somewhat unreliable...
>  - store this flag in ParseData.contentMeta - ugly hack. 

> Out of these three the last option seems the safest for now. From the 
> long-term point of view we should later on add 
> versioning information and handling of different versions in Parse. 

Parse (actually ParseImpl) is used as a temporary data structure to pass data 
from ParseSegment.map to ParseSegment.reduce (or Fetcher.something but you get 
the point). So, unless someone stores the temporary outputs of ParseSegment.map 
and wants to reduce them with this patch, I don't see what can go wrong. 
ParseOutputFormat writes parse text and parse data doesn't care about what else 
is in there.

> * the name of this method Parse.isFetched is somewhat misleading - it's not 
> about fetching or not, it's whether this Parse 
> corresponds to the original url or to a sub-url. Perhaps isCanonical, isRoot, 
> or some other name ...? 

Giving names to things is hard. Usually harder than creating them :). Will 
think of something here.

> * in ParseSegment - what's the reason for creating a new copy of ParseImpl in 
> this line below? I think we should store the one > we already have in "parse":

That's because Parser.getParse method's return value is Parse - not ParseImpl - 
which is not writable. So I take the not Writable Parse and create a Writable 
ParseImpl from it.

This is almost certainly not necessary, though. I will check this and update 
the patch.

> Thank you for your perseverance!

Sure, I just want to get this patch out of the way, so I can bug you all with 
my other patches:).

I will not send another patch, since I need some guidance on 1, I don't think 
that 2 and 3 are issues(but feel free to prove me wrong) and 4-5 are easy to 
solve.

Thanks again for your review.

> allow parsers to return multiple Parse object, this will speed up the rss 
> parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
> NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, 
> NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, 
> NUTCH-443.022507.patch.txt, NUTCH-443.02282007.patch, 
> parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser 
> can return multiple parse objects, that will all be indexed separately. 
> Advantage: no need to fetch all feed-items separately.
> see the discussion at 
> http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] [jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Reply via email to