[jira] [Commented] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job
[ https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092924#comment-15092924 ] Sebastian Nagel commented on NUTCH-1712: The merging is done together with minor improvements (https://github.com/apache/nutch/compare/trunk...sebastian-nagel:NUTCH-1712), but still need to adapt test unit (TestCrawlDbStates.java). > Use MultipleInputs in Injector to make it a single mapreduce job > > > Key: NUTCH-1712 > URL: https://issues.apache.org/jira/browse/NUTCH-1712 > Project: Nutch > Issue Type: Improvement > Components: injector >Affects Versions: 1.7 >Reporter: Tejas Patil >Assignee: Sebastian Nagel > Attachments: NUTCH-1712-trunk.v1.patch > > > Currently Injector creates two mapreduce jobs: > 1. sort job: get the urls from seeds file, emit CrawlDatum objects. > 2. merge job: read CrawlDatum objects from both crawldb and output of sort > job. Merge and emit final CrawlDatum objects. > Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls > from seeds file simultaneously and perform inject in a single map-reduce job. > Also, here are additional things covered with this jira: > 1. Pushed filtering and normalization above metadata extraction so that the > unwanted records are ruled out quickly. > 2. Migrated to new mapreduce API > 3. Improved documentation > 4. New junits with better coverage > Relevant discussion over nutch-dev can be found here: > http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job
[ https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-1712: -- Assignee: Sebastian Nagel (was: Tejas Patil) > Use MultipleInputs in Injector to make it a single mapreduce job > > > Key: NUTCH-1712 > URL: https://issues.apache.org/jira/browse/NUTCH-1712 > Project: Nutch > Issue Type: Improvement > Components: injector >Affects Versions: 1.7 >Reporter: Tejas Patil >Assignee: Sebastian Nagel > Attachments: NUTCH-1712-trunk.v1.patch > > > Currently Injector creates two mapreduce jobs: > 1. sort job: get the urls from seeds file, emit CrawlDatum objects. > 2. merge job: read CrawlDatum objects from both crawldb and output of sort > job. Merge and emit final CrawlDatum objects. > Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls > from seeds file simultaneously and perform inject in a single map-reduce job. > Also, here are additional things covered with this jira: > 1. Pushed filtering and normalization above metadata extraction so that the > unwanted records are ruled out quickly. > 2. Migrated to new mapreduce API > 3. Improved documentation > 4. New junits with better coverage > Relevant discussion over nutch-dev can be found here: > http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work started] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job
[ https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-1712 started by Sebastian Nagel. -- > Use MultipleInputs in Injector to make it a single mapreduce job > > > Key: NUTCH-1712 > URL: https://issues.apache.org/jira/browse/NUTCH-1712 > Project: Nutch > Issue Type: Improvement > Components: injector >Affects Versions: 1.7 >Reporter: Tejas Patil >Assignee: Sebastian Nagel > Attachments: NUTCH-1712-trunk.v1.patch > > > Currently Injector creates two mapreduce jobs: > 1. sort job: get the urls from seeds file, emit CrawlDatum objects. > 2. merge job: read CrawlDatum objects from both crawldb and output of sort > job. Merge and emit final CrawlDatum objects. > Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls > from seeds file simultaneously and perform inject in a single map-reduce job. > Also, here are additional things covered with this jira: > 1. Pushed filtering and normalization above metadata extraction so that the > unwanted records are ruled out quickly. > 2. Migrated to new mapreduce API > 3. Improved documentation > 4. New junits with better coverage > Relevant discussion over nutch-dev can be found here: > http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2190) Protocol normalizer
[ https://issues.apache.org/jira/browse/NUTCH-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092381#comment-15092381 ] Hudson commented on NUTCH-2190: --- SUCCESS: Integrated in Nutch-trunk # (See [https://builds.apache.org/job/Nutch-trunk//]) NUTCH-2190 Protocol normalizer (markus: [http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1724085]) * trunk/CHANGES.txt * trunk/build.xml * trunk/default.properties * trunk/src/plugin/build.xml * trunk/src/plugin/urlnormalizer-protocol * trunk/src/plugin/urlnormalizer-protocol/build.xml * trunk/src/plugin/urlnormalizer-protocol/data * trunk/src/plugin/urlnormalizer-protocol/data/protocols.txt * trunk/src/plugin/urlnormalizer-protocol/ivy.xml * trunk/src/plugin/urlnormalizer-protocol/plugin.xml * trunk/src/plugin/urlnormalizer-protocol/src * trunk/src/plugin/urlnormalizer-protocol/src/java * trunk/src/plugin/urlnormalizer-protocol/src/java/org * trunk/src/plugin/urlnormalizer-protocol/src/java/org/apache * trunk/src/plugin/urlnormalizer-protocol/src/java/org/apache/nutch * trunk/src/plugin/urlnormalizer-protocol/src/java/org/apache/nutch/net * trunk/src/plugin/urlnormalizer-protocol/src/java/org/apache/nutch/net/urlnormalizer * trunk/src/plugin/urlnormalizer-protocol/src/java/org/apache/nutch/net/urlnormalizer/protocol * trunk/src/plugin/urlnormalizer-protocol/src/java/org/apache/nutch/net/urlnormalizer/protocol/ProtocolURLNormalizer.java * trunk/src/plugin/urlnormalizer-protocol/src/test * trunk/src/plugin/urlnormalizer-protocol/src/test/org * trunk/src/plugin/urlnormalizer-protocol/src/test/org/apache * trunk/src/plugin/urlnormalizer-protocol/src/test/org/apache/nutch * trunk/src/plugin/urlnormalizer-protocol/src/test/org/apache/nutch/net * trunk/src/plugin/urlnormalizer-protocol/src/test/org/apache/nutch/net/urlnormalizer * trunk/src/plugin/urlnormalizer-protocol/src/test/org/apache/nutch/net/urlnormalizer/protocol * trunk/src/plugin/urlnormalizer-protocol/src/test/org/apache/nutch/net/urlnormalizer/protocol/TestProtocolURLNormalizer.java > Protocol normalizer > --- > > Key: NUTCH-2190 > URL: https://issues.apache.org/jira/browse/NUTCH-2190 > Project: Nutch > Issue Type: New Feature > Components: crawldb >Affects Versions: 1.11 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: NUTCH-2190.patch, NUTCH-2190.patch > > > URL normalizer to normalize protocols for specified hosts/domains, e.g. > normalizing http://www.apache.org/ to https://www.apache.org/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (NUTCH-2190) Protocol normalizer
[ https://issues.apache.org/jira/browse/NUTCH-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2190. -- Resolution: Fixed Assignee: Markus Jelsma Committed revision 1724085. > Protocol normalizer > --- > > Key: NUTCH-2190 > URL: https://issues.apache.org/jira/browse/NUTCH-2190 > Project: Nutch > Issue Type: New Feature > Components: crawldb >Affects Versions: 1.11 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: NUTCH-2190.patch, NUTCH-2190.patch > > > URL normalizer to normalize protocols for specified hosts/domains, e.g. > normalizing http://www.apache.org/ to https://www.apache.org/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2190) Protocol normalizer
[ https://issues.apache.org/jira/browse/NUTCH-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2190: - Attachment: NUTCH-2190.patch Final patch including all entries for build.xml and default.properties > Protocol normalizer > --- > > Key: NUTCH-2190 > URL: https://issues.apache.org/jira/browse/NUTCH-2190 > Project: Nutch > Issue Type: New Feature > Components: crawldb >Affects Versions: 1.11 >Reporter: Markus Jelsma > Fix For: 1.12 > > Attachments: NUTCH-2190.patch, NUTCH-2190.patch > > > URL normalizer to normalize protocols for specified hosts/domains, e.g. > normalizing http://www.apache.org/ to https://www.apache.org/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)