ParseData's contentMeta accumulates unnecessary values during parse
-------------------------------------------------------------------

                 Key: NUTCH-535
                 URL: https://issues.apache.org/jira/browse/NUTCH-535
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 1.0.0
            Reporter: Doğacan Güney
            Assignee: Doğacan Güney
             Fix For: 1.0.0


After NUTCH-506, if you run parse on a segment, parseData's contentMeta 
accumulates metadata of every content parsed so far. This is because NUTCH-506 
changed constructor to create a new metadata (before NUTCH-506, a new metadata 
was created for every call to readFields). It seems hadoop somehow caches 
Content instance so each new call to Content.readFields during ParseSegment 
increases size of metadata. Because of this, one can end up with *huge* 
parse_data directory (something like 10 times larger than content directory)




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to