[jira] [Commented] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job

2016-01-11 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092924#comment-15092924
 ] 

Sebastian Nagel commented on NUTCH-1712:


The merging is done together with minor improvements 
(https://github.com/apache/nutch/compare/trunk...sebastian-nagel:NUTCH-1712), 
but still  need to adapt test unit (TestCrawlDbStates.java).


> Use MultipleInputs in Injector to make it a single mapreduce job
> 
>
> Key: NUTCH-1712
> URL: https://issues.apache.org/jira/browse/NUTCH-1712
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 1.7
>Reporter: Tejas Patil
>Assignee: Sebastian Nagel
> Attachments: NUTCH-1712-trunk.v1.patch
>
>
> Currently Injector creates two mapreduce jobs:
> 1. sort job: get the urls from seeds file, emit CrawlDatum objects.
> 2. merge job: read CrawlDatum objects from both crawldb and output of sort 
> job. Merge and emit final CrawlDatum objects.
> Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls 
> from seeds file simultaneously and perform inject in a single map-reduce job.
> Also, here are additional things covered with this jira:
> 1. Pushed filtering and normalization above metadata extraction so that the 
> unwanted records are ruled out quickly.
> 2. Migrated to new mapreduce API
> 3. Improved documentation 
> 4. New junits with better coverage
> Relevant discussion over nutch-dev can be found here:
> http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job

2016-01-11 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-1712:
--

Assignee: Sebastian Nagel  (was: Tejas Patil)

> Use MultipleInputs in Injector to make it a single mapreduce job
> 
>
> Key: NUTCH-1712
> URL: https://issues.apache.org/jira/browse/NUTCH-1712
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 1.7
>Reporter: Tejas Patil
>Assignee: Sebastian Nagel
> Attachments: NUTCH-1712-trunk.v1.patch
>
>
> Currently Injector creates two mapreduce jobs:
> 1. sort job: get the urls from seeds file, emit CrawlDatum objects.
> 2. merge job: read CrawlDatum objects from both crawldb and output of sort 
> job. Merge and emit final CrawlDatum objects.
> Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls 
> from seeds file simultaneously and perform inject in a single map-reduce job.
> Also, here are additional things covered with this jira:
> 1. Pushed filtering and normalization above metadata extraction so that the 
> unwanted records are ruled out quickly.
> 2. Migrated to new mapreduce API
> 3. Improved documentation 
> 4. New junits with better coverage
> Relevant discussion over nutch-dev can be found here:
> http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job

2016-01-11 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-1712 started by Sebastian Nagel.
--
> Use MultipleInputs in Injector to make it a single mapreduce job
> 
>
> Key: NUTCH-1712
> URL: https://issues.apache.org/jira/browse/NUTCH-1712
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 1.7
>Reporter: Tejas Patil
>Assignee: Sebastian Nagel
> Attachments: NUTCH-1712-trunk.v1.patch
>
>
> Currently Injector creates two mapreduce jobs:
> 1. sort job: get the urls from seeds file, emit CrawlDatum objects.
> 2. merge job: read CrawlDatum objects from both crawldb and output of sort 
> job. Merge and emit final CrawlDatum objects.
> Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls 
> from seeds file simultaneously and perform inject in a single map-reduce job.
> Also, here are additional things covered with this jira:
> 1. Pushed filtering and normalization above metadata extraction so that the 
> unwanted records are ruled out quickly.
> 2. Migrated to new mapreduce API
> 3. Improved documentation 
> 4. New junits with better coverage
> Relevant discussion over nutch-dev can be found here:
> http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2190) Protocol normalizer

2016-01-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092381#comment-15092381
 ] 

Hudson commented on NUTCH-2190:
---

SUCCESS: Integrated in Nutch-trunk # (See 
[https://builds.apache.org/job/Nutch-trunk//])
NUTCH-2190 Protocol normalizer (markus: 
[http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1724085])
* trunk/CHANGES.txt
* trunk/build.xml
* trunk/default.properties
* trunk/src/plugin/build.xml
* trunk/src/plugin/urlnormalizer-protocol
* trunk/src/plugin/urlnormalizer-protocol/build.xml
* trunk/src/plugin/urlnormalizer-protocol/data
* trunk/src/plugin/urlnormalizer-protocol/data/protocols.txt
* trunk/src/plugin/urlnormalizer-protocol/ivy.xml
* trunk/src/plugin/urlnormalizer-protocol/plugin.xml
* trunk/src/plugin/urlnormalizer-protocol/src
* trunk/src/plugin/urlnormalizer-protocol/src/java
* trunk/src/plugin/urlnormalizer-protocol/src/java/org
* trunk/src/plugin/urlnormalizer-protocol/src/java/org/apache
* trunk/src/plugin/urlnormalizer-protocol/src/java/org/apache/nutch
* trunk/src/plugin/urlnormalizer-protocol/src/java/org/apache/nutch/net
* 
trunk/src/plugin/urlnormalizer-protocol/src/java/org/apache/nutch/net/urlnormalizer
* 
trunk/src/plugin/urlnormalizer-protocol/src/java/org/apache/nutch/net/urlnormalizer/protocol
* 
trunk/src/plugin/urlnormalizer-protocol/src/java/org/apache/nutch/net/urlnormalizer/protocol/ProtocolURLNormalizer.java
* trunk/src/plugin/urlnormalizer-protocol/src/test
* trunk/src/plugin/urlnormalizer-protocol/src/test/org
* trunk/src/plugin/urlnormalizer-protocol/src/test/org/apache
* trunk/src/plugin/urlnormalizer-protocol/src/test/org/apache/nutch
* trunk/src/plugin/urlnormalizer-protocol/src/test/org/apache/nutch/net
* 
trunk/src/plugin/urlnormalizer-protocol/src/test/org/apache/nutch/net/urlnormalizer
* 
trunk/src/plugin/urlnormalizer-protocol/src/test/org/apache/nutch/net/urlnormalizer/protocol
* 
trunk/src/plugin/urlnormalizer-protocol/src/test/org/apache/nutch/net/urlnormalizer/protocol/TestProtocolURLNormalizer.java


> Protocol normalizer
> ---
>
> Key: NUTCH-2190
> URL: https://issues.apache.org/jira/browse/NUTCH-2190
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2190.patch, NUTCH-2190.patch
>
>
> URL normalizer to normalize protocols for specified hosts/domains, e.g. 
> normalizing http://www.apache.org/ to https://www.apache.org/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2190) Protocol normalizer

2016-01-11 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-2190.
--
Resolution: Fixed
  Assignee: Markus Jelsma

Committed revision 1724085.


> Protocol normalizer
> ---
>
> Key: NUTCH-2190
> URL: https://issues.apache.org/jira/browse/NUTCH-2190
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2190.patch, NUTCH-2190.patch
>
>
> URL normalizer to normalize protocols for specified hosts/domains, e.g. 
> normalizing http://www.apache.org/ to https://www.apache.org/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2190) Protocol normalizer

2016-01-11 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2190:
-
Attachment: NUTCH-2190.patch

Final patch including all entries for build.xml and default.properties

> Protocol normalizer
> ---
>
> Key: NUTCH-2190
> URL: https://issues.apache.org/jira/browse/NUTCH-2190
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb
>Affects Versions: 1.11
>Reporter: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2190.patch, NUTCH-2190.patch
>
>
> URL normalizer to normalize protocols for specified hosts/domains, e.g. 
> normalizing http://www.apache.org/ to https://www.apache.org/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)