date:20180612

Re: Nutch 1.14 issues

2018-06-12 Thread Arkadi.Kosmynin

Hi Sebastian,

Sorry, clarifying my objectives:

I am not frustrated, just trying to help. I did not write this message to 
request fixes for Arch. All these issues have been fixed in Arch, except 
perhaps the native library issue, but I may fix it as well, if lucky enough. I 
wrote that message to contribute back to Nutch, because I consider these issues 
(at least, some of them) very important for Nutch.

I do understand that Nutch is supported by volunteers, and I really appreciate 
the work your are doing.

I will open JIRA issues.

Regards,

Arkadi   

From: Sebastian Nagel 
Sent: Wednesday, 13 June 2018 12:24 AM
To: dev@nutch.apache.org
Subject: Re: Nutch 1.14 issues

Hi Arkadi,

thanks for your feedback and suggestions.
I can understand your frustration but I also want to clarify:

- Arch is a nice project, for sure. But Arch is GPL licensed
  which makes contributions a one-way route (Nutch -> Arch)
  and causes me even not to look into the Arch sources. Sorry.

- Please take the time to split your list of issues into separate
  requests on the mailing list or open separate Jira issues.
  Also take care that the problems are reproducible by sharing
  documents failed to parse, log snippets, config files, etc.

- Sorry about NUTCH-2071, I took this mainly as a class path issue
  in the parse-tika plugin (which is solved). Now I understand better
  what your objective is and I'll will review and try to fix it
  (in combination with NUTCH-1993). But again: please take the time
  to explain your objectives, ping committers if fixes make no progress,
  etc.

- Nutch is a community project. There are no "paid" committers. This
  means although some of us are paid to configure/operate/adapt crawlers
  nobody is delegated to fix issues, support Nutch users, etc.
  That's voluntary work.

- Everybody is welcome to contribute (patches, documentation, support
  on the mailing list, etc.)  Because Nutch is a small project this
  will help us definitely.


Thanks,
Sebastian



On 06/12/2018 08:46 AM, arkadi.kosmy...@csiro.au wrote:
> Hi guys,
>
>
>
> I am porting Arch (https://www.atnf.csiro.au/computing/software/arch/) to 
> Nutch 1.14 and Solr 7.2,
> and I have come across a few serious issues, of which you should be aware:
>
>
>
> 1.   The Nutch-2071 is still an issue in 1.14, because the returned 
> parseResult is never null.
> If a parser fails to parse a document, it returns an empty result, but not 
> null. This means that,
> from a chain of parser candidates, only the first one has a chance to try to 
> parse the document.
>
> 2.   Nutch adopted Tika as a general parsing tool, and stopped supporting 
> “legacy” parsing (OO,
> MS) plugins. I continued using them and hoped to stop supporting them in the 
> next version of Arch I
> am preparing to be released, but I still can’t do it, because Tika fails to 
> parse too many documents
> on our site. But, when I reinforce Tika with the legacy parsers, I achieve 
> almost 100% parsing
> success rate. This is why NUTCH-2071 is important for Arch. I think you 
> should bring back legacy
> parsers to Nutch, because the quality of parsing of “real life” data, such as 
> ours, is not great
> without them.
>
> 3.   The lines defining fall-back (*) plugin in parse-plugins.xml are not 
> effective, because
> they are ignored, as long as there is at least one plugin claiming * in its 
> plugin.xml file. In some
> cases, Nutch assigns * capability to plugins that don’t even claim it. For 
> example, I can’t
> understand, why Arch content blocking plugin gets it.
>
> 4.   In earlier versions of Nutch, use of the native libraries really 
> helped. It reduced
> crawling of our site from a couple of days to 6-7 hours. In Nutch 1.14, I 
> don’t notice this. I’ve
> obtained Hadoop libraries, placed them where they are expected, even inserted 
> an explicit load
> library call in my code, but I still don’t notice any significant time 
> savings.
>
> 5.   The Feed plugin seems to have a major problem. The line 102 in  
> FeedIndexingFilter.java
> generated a NumberFormatException (which caused the failure of the entire 
> crawling process!) because
> it was trying to parse a date in string format, not a number. Given that this 
> metadata piece was
> generated by the feed parser (same plugin), it seems that the plugin is in 
> disagreement with itself.
>
> 6.   This is less important, but when Tika fails to parse a document, it 
> generates a scary error
> message and ugly stack trace. I think this should be a one line warning, 
> because other parsers may
> still parse this document successfully.
>
>
>
> Hope this helps.
>
>
>
> Regards,
>
>
>
> Arkadi
>

[jira] [Updated] (NUTCH-2147) MetadataScoringFilter for Nutch

2018-06-12 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2147:
---
Fix Version/s: (was: 1.15)

> MetadataScoringFilter for Nutch
> ---
>
> Key: NUTCH-2147
> URL: https://issues.apache.org/jira/browse/NUTCH-2147
> Project: Nutch
>  Issue Type: New Feature
>  Components: plugin, scoring
>Affects Versions: 1.10
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
>
> This issue originally started by envisioning an implementation of a 
> LanguagePreferenceScoringFilter so that Nutch could easily be made into a 
> directed crawler based on crawl administrator ranking preferences of 
> languages we wish to crawl. 
> Right now this is not possible.
> We already detect and index language within the language-identifier plugin as 
> well as within parse-tika irrc, however currently the presence of a language 
> does not effect scoring of pages.
> The scope of this issue has changed to make it more generally applicable for 
> a wider variety of use cases. This will therefore take advantage of 
> NUTCH-1980 by pulling (amongst other things) Language entries from the 
> CrawlDB Metadata.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (NUTCH-2209) Improved Tokenization for Similarity Scoring plugin

2018-06-12 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2209:
---
Fix Version/s: (was: 1.15)

> Improved Tokenization for Similarity Scoring plugin
> ---
>
> Key: NUTCH-2209
> URL: https://issues.apache.org/jira/browse/NUTCH-2209
> Project: Nutch
>  Issue Type: Improvement
>  Components: scoring
>Reporter: Sujen Shah
>Assignee: Sujen Shah
>Priority: Major
>  Labels: memex
>
> This patch would add Lucene based tokenization to the cosine similarity 
> plugin and clean up the code currently present. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (NUTCH-2249) WordNet Integration for Cosine Similarity

2018-06-12 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2249:
---
Fix Version/s: (was: 1.15)

> WordNet Integration for Cosine Similarity
> -
>
> Key: NUTCH-2249
> URL: https://issues.apache.org/jira/browse/NUTCH-2249
> Project: Nutch
>  Issue Type: New Feature
>  Components: plugin, scoring
>Reporter: Bhavya Sanghavi
>Assignee: Sujen Shah
>Priority: Minor
>  Labels: memex
>
> Integrated WordNet database to enhance the cosine similarity plugin. 
> This helps in reducing the size of the vectors for calculating the cosine 
> similarity by mapping the synonymous words to the same entry in the vector. 
> Consequently, it would increase the accuracy of the scores given to the 
> webpages to be crawled. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (NUTCH-2265) Write A Test Package for Scoring Similarity

2018-06-12 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2265:
---
Fix Version/s: (was: 1.15)

> Write A Test Package for Scoring Similarity
> ---
>
> Key: NUTCH-2265
> URL: https://issues.apache.org/jira/browse/NUTCH-2265
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, scoring
>Reporter: Furkan KAMACI
>Assignee: Furkan KAMACI
>Priority: Major
>
> There is no test package for org.apache.nutch.scoring.similarity and it 
> should be implemented.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2239) Selenium Handlers for Ajax Patterns from Student submissions

2018-06-12 Thread Sebastian Nagel (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510090#comment-16510090
 ] 

Sebastian Nagel commented on NUTCH-2239:


Hi [~chrismattmann], still in progress?

> Selenium Handlers for Ajax Patterns from Student submissions
> 
>
> Key: NUTCH-2239
> URL: https://issues.apache.org/jira/browse/NUTCH-2239
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher, protocol
>Reporter: Raghav Bharadwaj Jayasimha Rao
>Assignee: Chris A. Mattmann
>Priority: Major
>  Labels: memex
>
> - Refactor student submissions from USC class of CSCI 572 to obtain a 
> comprehensive set of selenium handlers for various Ajax Patterns



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (NUTCH-2239) Selenium Handlers for Ajax Patterns from Student submissions

2018-06-12 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2239:
---
Fix Version/s: (was: 1.15)

> Selenium Handlers for Ajax Patterns from Student submissions
> 
>
> Key: NUTCH-2239
> URL: https://issues.apache.org/jira/browse/NUTCH-2239
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher, protocol
>Reporter: Raghav Bharadwaj Jayasimha Rao
>Assignee: Chris A. Mattmann
>Priority: Major
>  Labels: memex
>
> - Refactor student submissions from USC class of CSCI 572 to obtain a 
> comprehensive set of selenium handlers for various Ajax Patterns



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (NUTCH-2251) Make CommonCrawlFormatJackson instance reusable by properly handling object state

2018-06-12 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2251.

   Resolution: Duplicate
Fix Version/s: (was: 1.15)

> Make CommonCrawlFormatJackson instance reusable by properly handling object 
> state
> -
>
> Key: NUTCH-2251
> URL: https://issues.apache.org/jira/browse/NUTCH-2251
> Project: Nutch
>  Issue Type: Sub-task
>  Components: commoncrawl
>Reporter: Thamme Gowda
>Priority: Major
>
> The class `CommonCrawlFormatJackson` keeps appending the documents when it is 
> used for more formatting more than one document. 
> This class shall be modified to handle states such that the same instance can 
> be used instead of creating new one for each document being dumped.
> This suggestion has been mentioned in the previous fix related to format 
> issue : https://github.com/apache/nutch/pull/103



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2382) indexer-hbase Nutch 1.x branch

2018-06-12 Thread Sebastian Nagel (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510086#comment-16510086
 ] 

Sebastian Nagel commented on NUTCH-2382:


After NUTCH-1480 the patch needs to be updated. Moving to 1.16.

> indexer-hbase Nutch 1.x branch
> --
>
> Key: NUTCH-2382
> URL: https://issues.apache.org/jira/browse/NUTCH-2382
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.13
>Reporter: Jurian Broertjes
>Priority: Major
> Fix For: 1.16
>
> Attachments: NUTCH-2382-indexer-hbase-p1.patch
>
>
> I've ported the indexer-hbase for Nutch 2.x 
> (https://github.com/apache/nutch/pull/184) to 1.x. Dit some basic tests. 
> Patch is attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (NUTCH-2382) indexer-hbase Nutch 1.x branch

2018-06-12 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2382:
---
Fix Version/s: (was: 1.15)
   1.16

> indexer-hbase Nutch 1.x branch
> --
>
> Key: NUTCH-2382
> URL: https://issues.apache.org/jira/browse/NUTCH-2382
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.13
>Reporter: Jurian Broertjes
>Priority: Major
> Fix For: 1.16
>
> Attachments: NUTCH-2382-indexer-hbase-p1.patch
>
>
> I've ported the indexer-hbase for Nutch 2.x 
> (https://github.com/apache/nutch/pull/184) to 1.x. Dit some basic tests. 
> Patch is attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (NUTCH-2312) Support PhantomJS as a WebDriver in protocol-selenium

2018-06-12 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2312.

   Resolution: Incomplete
Fix Version/s: (was: 1.15)

No patch/PR provided so far.

> Support PhantomJS as a WebDriver in protocol-selenium
> -
>
> Key: NUTCH-2312
> URL: https://issues.apache.org/jira/browse/NUTCH-2312
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 1.12
>Reporter: Joey Hong
>Priority: Trivial
>  Labels: easyfix
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> PhantomJS is a great parallelizable and headless browser to work with Nutch 
> via protocol-selenium. It looks like the phantomjs JAR is already in the 
> dependencies, and an empty initialization for the PhantomJSDriver exists in 
> protocol-selenium source code.
> However, at its current state, protocol-selenium will not fetch any URLs with 
> phantomjs, and configurations must be passed in via a DesiredCapabilities 
> object. Also a parameter must be created to allow users to add a path to 
> their phantomjs binary inside nutch-site.xml.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (NUTCH-2267) Solr indexer fails at the end of the job with a java error message

2018-06-12 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2267:
---
Fix Version/s: (was: 1.15)

> Solr indexer fails at the end of the job with a java error message
> --
>
> Key: NUTCH-2267
> URL: https://issues.apache.org/jira/browse/NUTCH-2267
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.12
> Environment: hadoop v2.7.2  solr6 in cloud configuration with 
> zookeeper 3.4.6. I use the master branch from github currently on commit 
> da252eb7b3d2d7b70   ( NUTCH - 2263 mingram and maxgram support for Unigram 
> Cosine Similarity Model is provided. )
>Reporter: kaveh minooie
>Assignee: Lewis John McGibbney
>Priority: Major
>
> this is was what I was getting first:
> 16/05/23 13:52:27 INFO mapreduce.Job:  map 100% reduce 100%
> 16/05/23 13:52:27 INFO mapreduce.Job: Task Id : 
> attempt_1462499602101_0119_r_00_0, Status : FAILED
> Error: Bad return type
> Exception Details:
>   Location:
> org/apache/solr/client/solrj/impl/HttpClientUtil.createClient(Lorg/apache/solr/common/params/SolrParams;Lorg/apache/http/conn/ClientConnectionManager;)Lorg/apache/http/impl/client/CloseableHttpClient;
>  @58: areturn
>   Reason:
> Type 'org/apache/http/impl/client/DefaultHttpClient' (current frame, 
> stack[0]) is not assignable to 
> 'org/apache/http/impl/client/CloseableHttpClient' (from method signature)
>   Current Frame:
> bci: @58
> flags: { }
> locals: { 'org/apache/solr/common/params/SolrParams', 
> 'org/apache/http/conn/ClientConnectionManager', 
> 'org/apache/solr/common/params/ModifiableSolrParams', 
> 'org/apache/http/impl/client/DefaultHttpClient' }
> stack: { 'org/apache/http/impl/client/DefaultHttpClient' }
>   Bytecode:
> 0x000: bb00 0359 2ab7 0004 4db2 0005 b900 0601
> 0x010: 0099 001e b200 05bb 0007 59b7 0008 1209
> 0x020: b600 0a2c b600 0bb6 000c b900 0d02 002b
> 0x030: b800 104e 2d2c b800 0f2d b0
>   Stackmap Table:
> append_frame(@47,Object[#143])
> 16/05/23 13:52:28 INFO mapreduce.Job:  map 100% reduce 0% 
> as you can see the failed reducer gets re-spawned. then I found this issue: 
> https://issues.apache.org/jira/browse/SOLR-7657 and I updated my hadoop 
> config file. after that, the indexer seems to be able to finish ( I got the 
> document in the solr, it seems ) but I still get the error message at the end 
> of the job:
> 16/05/23 16:39:26 INFO mapreduce.Job:  map 100% reduce 99%
> 16/05/23 16:39:44 INFO mapreduce.Job:  map 100% reduce 100%
> 16/05/23 16:39:57 INFO mapreduce.Job: Job job_1464045047943_0001 completed 
> successfully
> 16/05/23 16:39:58 INFO mapreduce.Job: Counters: 53
>   File System Counters
>   FILE: Number of bytes read=42700154855
>   FILE: Number of bytes written=70210771807
>   FILE: Number of read operations=0
>   FILE: Number of large read operations=0
>   FILE: Number of write operations=0
>   HDFS: Number of bytes read=8699202825
>   HDFS: Number of bytes written=0
>   HDFS: Number of read operations=537
>   HDFS: Number of large read operations=0
>   HDFS: Number of write operations=0
>   Job Counters 
>   Launched map tasks=134
>   Launched reduce tasks=1
>   Data-local map tasks=107
>   Rack-local map tasks=27
>   Total time spent by all maps in occupied slots (ms)=49377664
>   Total time spent by all reduces in occupied slots (ms)=32765064
>   Total time spent by all map tasks (ms)=3086104
>   Total time spent by all reduce tasks (ms)=1365211
>   Total vcore-milliseconds taken by all map tasks=3086104
>   Total vcore-milliseconds taken by all reduce tasks=1365211
>   Total megabyte-milliseconds taken by all map tasks=12640681984
>   Total megabyte-milliseconds taken by all reduce tasks=8387856384
>   Map-Reduce Framework
>   Map input records=25305474
>   Map output records=25305474
>   Map output bytes=27422869763
>   Map output materialized bytes=27489888004
>   Input split bytes=15225
>   Combine input records=0
>   Combine output records=0
>   Reduce input groups=16061459
>   Reduce shuffle bytes=27489888004
>   Reduce input records=25305474
>   Reduce output records=230
>   Spilled Records=54688613
>   Shuffled Maps =134
>   Failed Shuffles=0
>   Merged Map outputs=134
>   GC time elapsed (ms)=88103
>

[jira] [Updated] (NUTCH-2369) Create a new GraphGenerator Tool for writing Nutch Records as a Full Web Graph

2018-06-12 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2369:
---
Fix Version/s: (was: 1.15)

> Create a new GraphGenerator Tool for writing Nutch Records as a Full Web Graph
> --
>
> Key: NUTCH-2369
> URL: https://issues.apache.org/jira/browse/NUTCH-2369
> Project: Nutch
>  Issue Type: Task
>  Components: crawldb, graphgenerator, hostdb, linkdb, segment, 
> storage, tool
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
>  Labels: gsoc2017, gsoc2018
>
> I've been thinking for quite some time now that a new Tool which writes Nutch 
> data out as full graph data would be an excellent addition to the codebase.
> My thoughts involves writing data using Tinkerpop's ScriptInputFormat and 
> ScriptOutputFormat's to create Vertex objects representing Nutch Crawl 
> Records. 
> http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptInputFormat.html
> http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptOutputFormat.html
> I envisage that each Vertex object would require the CrawlDB, LinkDB a 
> Segment and possibly the HostDB in order to be fully populated. Graph 
> characteristics e.g. Edge's would comes from those existing data structures 
> as well.
> It is my intention to propose this as a GSoC project for 2017 and I have 
> already talked offline with a potential student [~omkar20895] about him 
> participating as the student.
> Essentially, if we were able to create a Graph enabling true traversal, this 
> could be a game changer for how Nutch Crawl data is interpreted. It is my 
> feeling that this issue most likely also involved an entire upgrade of the 
> Hadoop API's from mapred to mapreduce for the master codebase.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2140) Atomic update and optimistic concurrency update using Solr

2018-06-12 Thread Sebastian Nagel (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510075#comment-16510075
 ] 

Sebastian Nagel commented on NUTCH-2140:


Hi [~roannel], is this still a requirement or is it already done?

> Atomic update and optimistic concurrency update using Solr
> --
>
> Key: NUTCH-2140
> URL: https://issues.apache.org/jira/browse/NUTCH-2140
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Affects Versions: 1.9
>Reporter: Roannel Fernández Hernández
>Priority: Major
> Fix For: 1.15
>
> Attachments: NUTCH-2140-v1.patch, NUTCH-2140-v2.patch
>
>
> The SOLRIndexWriter plugin allows to index the documents into a Solr server. 
> The plugin replaces the documents that already are indexed into Solr. 
> Sometimes, replace only one field or add new fields and keep the others 
> values of the documents indexed is useful.
> Solr supports two approaches for this task: Atomic update and optimistic 
> concurrency update. However, the SOLRIndexWriter plugin doesn't support that 
> approaches.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (NUTCH-2032) Plugin to index the raw content of a readable document.

2018-06-12 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2032:
---
Fix Version/s: (was: 1.15)

> Plugin to index the raw content of a readable document. 
> 
>
> Key: NUTCH-2032
> URL: https://issues.apache.org/jira/browse/NUTCH-2032
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer, parser
>Affects Versions: 1.10
>Reporter: Luis Lopez
>Assignee: Lewis John McGibbney
>Priority: Major
>  Labels: content, index, index-rawcontent, parser, raw
>
> This is related to https://issues.apache.org/jira/browse/NUTCH-1785 and 
> https://issues.apache.org/jira/browse/NUTCH-1458
> We created a couple plugins to index the raw content of readable documents. 
> If we include these plugins in the plugin chain we'll index the raw content 
> of a readable document, i.e. XML, HTML, CSV, TXT etc. The index-rawcontent 
> plugin is not designed to index binary files, however having the full content 
> of an HTML/XML or a CSV document is really critical for some of us.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2030) ParseZip plugin is not able to extract language from zip document,this could solve that problem.

2018-06-12 Thread Sebastian Nagel (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510069#comment-16510069
 ] 

Sebastian Nagel commented on NUTCH-2030:


So, it's about parse-zip or the "lang" field defined in the Solr schema.xml to 
be multi-valued?

> ParseZip plugin is not able to extract language from zip document,this could 
> solve that problem.
> 
>
> Key: NUTCH-2030
> URL: https://issues.apache.org/jira/browse/NUTCH-2030
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser, plugin
> Environment: Linux Mint 17 qiana, 4 GB Ram,Core I3.
>Reporter: Eyeris Rodriguez Rueda
>Priority: Minor
> Fix For: 1.16
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Actually parse-zip plugin don´t extract language from zip document, therefore 
> lang field is empty in solr or elastic. If the package(.zip) contains a list 
> of documents so the lang field could be multivalued to support that list of 
> languages. A simple change to parse-zip pluging could fix this problem. I 
> will use Language Identifier class from tika and analyze each document inside.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (NUTCH-2334) Extension point for schedulers

2018-06-12 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2334:
---
Fix Version/s: (was: 1.15)
   1.16

> Extension point for schedulers
> --
>
> Key: NUTCH-2334
> URL: https://issues.apache.org/jira/browse/NUTCH-2334
> Project: Nutch
>  Issue Type: New Feature
>  Components: generator
>Affects Versions: 1.12
>Reporter: Roannel Fernández Hernández
>Priority: Minor
> Fix For: 1.16
>
>
> With an extension point for schedulers, the users should be able to create 
> new schedulers that meet to their own needs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (NUTCH-2030) ParseZip plugin is not able to extract language from zip document,this could solve that problem.

2018-06-12 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2030:
---
Fix Version/s: (was: 1.15)
   1.16

> ParseZip plugin is not able to extract language from zip document,this could 
> solve that problem.
> 
>
> Key: NUTCH-2030
> URL: https://issues.apache.org/jira/browse/NUTCH-2030
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser, plugin
> Environment: Linux Mint 17 qiana, 4 GB Ram,Core I3.
>Reporter: Eyeris Rodriguez Rueda
>Priority: Minor
> Fix For: 1.16
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Actually parse-zip plugin don´t extract language from zip document, therefore 
> lang field is empty in solr or elastic. If the package(.zip) contains a list 
> of documents so the lang field could be multivalued to support that list of 
> languages. A simple change to parse-zip pluging could fix this problem. I 
> will use Language Identifier class from tika and analyze each document inside.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (NUTCH-2292) Mavenize the build for nutch-core and nutch-plugins

2018-06-12 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2292:
---
Fix Version/s: (was: 1.15)
   1.16

> Mavenize the build for nutch-core and nutch-plugins
> ---
>
> Key: NUTCH-2292
> URL: https://issues.apache.org/jira/browse/NUTCH-2292
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Reporter: Thamme Gowda
>Assignee: Thamme Gowda
>Priority: Major
> Fix For: 1.16
>
>
> Convert the build system of  nutch-core as well as plugins to Apache Maven.
> *Plan :*
> Create multi-module maven project with the following structure
> {code}
> nutch-parent
>   |-- pom.xml (POM)
>   |-- nutch-core
>   |   |-- pom.xml (JAR)
>   |   |--src: sources
>   |-- nutch-plugins
>   |-- pom.xml (POM)
>   |-- plugin1
>   ||-- pom.xml (JAR)
>   | .
>   |-- pluginN
>|-- pom.xml (JAR)
> {code}
> NOTE: watch out for cyclic dependencies bwteen nutch-core and plugins, 
> introduce another POM to break the cycle if required.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (NUTCH-2564) protocol-http throws an error when the content-length header is not a number

2018-06-12 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2564.

Resolution: Fixed

> protocol-http throws an error when the content-length header is not a number
> 
>
> Key: NUTCH-2564
> URL: https://issues.apache.org/jira/browse/NUTCH-2564
> Project: Nutch
>  Issue Type: Sub-task
>Reporter: Gerard Bouchar
>Priority: Major
>
> When a server sends an invalid Content-Length header (one that is not a valid 
> number) with a plain-text http body, browsers simply ignore it, but 
> protocol-http has a strange approach: if the header is composed only of white 
> spaces, it ignores it, but if it contains other characters, it throws an 
> error, preventing us from doing anything with the page.
> It should simply ignore invalid Content-Length headers.
>  
> Relevant code: 
> [https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L354-L359]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (NUTCH-2560) protocol-http throws an error when an http header spans over multiple lines

2018-06-12 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2560.

Resolution: Cannot Reproduce

Thanks, [~gbouchar]. There is now a unit test for multi-line headers. As said, 
multi-line headers work as expected if they follow the deprecated specs and 
continuation lines are intended. Please reopen with a concrete example if this 
is still a problem.

> protocol-http throws an error when an http header spans over multiple lines
> ---
>
> Key: NUTCH-2560
> URL: https://issues.apache.org/jira/browse/NUTCH-2560
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.14
>Reporter: Gerard Bouchar
>Priority: Major
> Fix For: 1.15
>
>
> Some servers invalidly send headers that span over multiple lines. In that 
> case, browsers simply ignore the subsequent lines, but protocol-http throws 
> an error, thus preventing us from fetching the contents of the page.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (NUTCH-2512) Nutch does not build under JDK9

2018-06-12 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2512:
---
Fix Version/s: (was: 1.15)
   1.16

> Nutch does not build under JDK9
> ---
>
> Key: NUTCH-2512
> URL: https://issues.apache.org/jira/browse/NUTCH-2512
> Project: Nutch
>  Issue Type: Bug
>  Components: build, injector
>Affects Versions: 1.14
> Environment: Ubuntu 16.04 (All patches up to 02/20/2018)
> Oracle Java 9 - Oracle JDK 9 (Latest as off 02/22/2018)
>Reporter: Ralf
>Priority: Major
> Fix For: 1.16
>
>
> Nutch 1.14 (Source) does not compile properly under JDK 9
> Nutch 1.14 (Binary) does not function under Java 9
>  
> When trying to Nuild Nutch, Ant complains about missing Sonar files then 
> exits with:
> "BUILD FAILED
> /home/nutch/nutch/build.xml:79: Unparseable date: "01/25/1971 2:00 pm" "
>  
> Once having commented out the "offending code" the Build finishes but the 
> resulting Binary fails to function (as well as the Apache Compiled Binary 
> distribution), Both exit with:
>  
> Injecting seed URLs
> /home/nutch/nutch2/bin/nutch inject searchcrawl//crawldb urls/
> Injector: starting at 2018-02-21 02:02:16
> Injector: crawlDb: searchcrawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> WARNING: An illegal reflective access operation has occurred
> WARNING: Illegal reflective access by 
> org.apache.hadoop.security.authentication.util.KerberosUtil 
> (file:/home/nutch/nutch2/lib/hadoop-auth-2.7.4.jar) to method 
> sun.security.krb5.Config.getInstance()
> WARNING: Please consider reporting this to the maintainers of 
> org.apache.hadoop.security.authentication.util.KerberosUtil
> WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
> WARNING: All illegal access operations will be denied in a future release
> Injector: java.lang.NullPointerException
>         at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getBlockIndex(FileInputFormat.java:444)
>         at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:413)
>         at 
> org.apache.hadoop.mapreduce.lib.input.DelegatingInputFormat.getSplits(DelegatingInputFormat.java:115)
>         at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>         at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>         at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>         at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>         at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>         at java.base/java.security.AccessController.doPrivileged(Native 
> Method)
>         at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>         at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>         at org.apache.nutch.crawl.Injector.inject(Injector.java:417)
>         at org.apache.nutch.crawl.Injector.run(Injector.java:563)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>         at org.apache.nutch.crawl.Injector.main(Injector.java:528)
>  
> Error running:
>   /home/nutch/nutch2/bin/nutch inject searchcrawl//crawldb urls/
> Failed with exit value 255.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (NUTCH-2549) protocol-http does not behave the same as browsers

2018-06-12 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2549.

Resolution: Fixed

Thanks, [~gbouchar] for the careful analysis!

> protocol-http does not behave the same as browsers
> --
>
> Key: NUTCH-2549
> URL: https://issues.apache.org/jira/browse/NUTCH-2549
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Gerard Bouchar
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
> Attachments: NUTCH-2549.patch
>
>
> We identified the following issues in protocol-http (a plugin implementing 
> the HTTP protocol):
>  * It fails if an url's path does not start with '/'
>  ** Example: [http://news.fx678.com?171|http://news.fx678.com/?171] (browsers 
> correctly rewrite the url as [http://news.fx678.com/?171], while nutch tries 
> to send an invalid HTTP request starting with *GET ?171 HTTP/1.0*.
>  * It advertises its requests as being HTTP/1.0, but sends an 
> _Accept-Encoding_ request header, that is defined only in HTTP/1.1. This 
> confuses some web servers
>  ** Example: 
> [http://www.hansamanuals.com/main/english/none/theconf___987/manuals/version___82/hwconvindex.htm]
>  * If a server sends a redirection (3XX status code, with a Location header), 
> protocol-http tries to parse the HTTP response body anyway. Thus, if an error 
> occurs while decoding the body, the redirection is not followed and the 
> information is lost. Browsers follow the redirection and close the socket 
> soon as they can.
>  ** Example: [http://www.webarcelona.net/es/blog?page=2]
>  * Some servers invalidly send an HTTP body directly without a status line or 
> headers. Browsers handle that, protocol-http doesn't:
>  ** Example: [https://app.unitymedia.de/]
>  * Some servers invalidly add colons after the HTTP status code in the status 
> line (they can send _HTTP/1.1 404: Not found_ instead of _HTTP/1.1 404 Not 
> found_ for instance). Browsers can handle that.
>  * Some servers invalidly send headers that span over multiple lines. In that 
> case, browsers simply ignore the subsequent lines, but protocol-http throws 
> an error, thus preventing us from fetching the contents of the page.
>  * There is no limit over the size of the HTTP headers it reads. A bogus 
> server could send an infinite stream of different HTTP headers and cause the 
> fetcher to go out of memory, or send the same HTTP header repeatedly and 
> cause the fetcher to timeout.
>  * The same goes for the HTTP status line: no check is made concerning its 
> size.
>  * While reading chunked content, if the content size becomes larger than 
> {color:#9876aa}http{color}.getMaxContent(), instead of just stopping, it 
> tries to read a new chunk before having read the previous one completely, 
> resulting in a '{color:#33}bad chunk length' error.{color}
> {color:#33}Additionally (and that concerns protocol-httpclient as well), 
> when reading http headers, for each header, the SpellCheckedMetadata class 
> computes a Levenshtein distance between it and every  known header in the 
> HttpHeaders interface. Not only is that slow, non-standard, and non-conform 
> to browsers' behavior, but it also causes bugs and prevents us from accessing 
> the real headers sent by the HTTP server.{color}
>  * {color:#33}Example: [http://www.taz.de/!443358/] . The server sends a 
> *Client-Transfer-Encoding: chunked* header, but SpellCheckedMetadata corrects 
> it to *Transfer-Encoding: chunked*. Then, HttpResponse (in protocol-http) 
> tries to read the HTTP body as chunked, whereas it is not.{color}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (NUTCH-2563) HTTP header spellchecking issues

2018-06-12 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2563.

Resolution: Fixed

> HTTP header spellchecking issues
> 
>
> Key: NUTCH-2563
> URL: https://issues.apache.org/jira/browse/NUTCH-2563
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.14
>Reporter: Gerard Bouchar
>Priority: Major
> Fix For: 1.15
>
>
> {color:#33}When reading http headers, for each header, the 
> SpellCheckedMetadata class computes a Levenshtein distance between it and 
> every  known header in the HttpHeaders interface. Not only is that slow, 
> non-standard, and non-conform to browsers' behavior, but it also causes bugs 
> and prevents us from accessing the real headers sent by the HTTP 
> server.{color}
>  * {color:#33}Example: [http://www.taz.de/!443358/] . The server sends a 
> *Client-Transfer-Encoding: chunked* header, but SpellCheckedMetadata corrects 
> it to *Transfer-Encoding: chunked*. Then, HttpResponse (in protocol-http) 
> tries to read the HTTP body as chunked, whereas it is not.{color}
> {color:#33}I personally think that HTTP header spell checking is a bad 
> idea, and that this logic should be completely removed. But if it were to be 
> kept, the threshold (SpellCheckedMetadata.TRESHOLD_DIVIDER) should be higher 
> (we internally set it to 5 as a temporary fix for this issue){color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (NUTCH-2557) protocol-http fails to follow redirections when an HTTP response body is invalid

2018-06-12 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2557.

Resolution: Fixed

Thanks, [~gbouchar] and [~omkar20895]!

> protocol-http fails to follow redirections when an HTTP response body is 
> invalid
> 
>
> Key: NUTCH-2557
> URL: https://issues.apache.org/jira/browse/NUTCH-2557
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.14
>Reporter: Gerard Bouchar
>Priority: Major
> Fix For: 1.15
>
>
> If a server sends a redirection (3XX status code, with a Location header), 
> protocol-http tries to parse the HTTP response body anyway. Thus, if an error 
> occurs while decoding the body, the redirection is not followed and the 
> information is lost. Browsers follow the redirection and close the socket 
> soon as they can.
>  * Example: this page is a redirection to its https version, with an HTTP 
> body containing invalidly gzip encoded contents. Browsers follow the 
> redirection, but nutch throws an error:
>  ** [http://www.webarcelona.net/es/blog?page=2]
>  
> The HttpResponse::getContent class can already return null. I think it should 
> at least return null when parsing the HTTP response body fails.
> Ideally, we would adopt the same behavior as browsers, and not even try 
> parsing the body when the headers indicate a redirection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (NUTCH-2561) protocol-http can be made to read arbitrarily large HTTP responses

2018-06-12 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2561.

Resolution: Fixed

Thanks, [~gbouchar], esp. for the idea for the unit test server.

> protocol-http can be made to read arbitrarily large HTTP responses
> --
>
> Key: NUTCH-2561
> URL: https://issues.apache.org/jira/browse/NUTCH-2561
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.14
>Reporter: Gerard Bouchar
>Priority: Critical
> Fix For: 1.15
>
> Attachments: evilserver.py
>
>
> protocol-http limits the size of the HTTP response body. However
>  * There is no limit over the size of the HTTP headers it reads. A bogus 
> server could send an infinite stream of different HTTP headers and cause the 
> fetcher to go out of memory, or send the same HTTP header repeatedly and 
> cause the fetcher to timeout.
>  * The same goes for the HTTP status line: no check is made concerning its 
> size.
> This can be both a performance and a security problem.
> Joined is an example python implementation of a server that makes 
> protocol-http receive huge amounts of data and use a lot of CPU (because of 
> NUTCH-2563), without being stopped by http.getTimeout() nor 
> http.getMaxContent().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (NUTCH-2559) protocol-http cannot handle colons after the HTTP status code

2018-06-12 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2559.

Resolution: Fixed

> protocol-http cannot handle colons after the HTTP status code
> -
>
> Key: NUTCH-2559
> URL: https://issues.apache.org/jira/browse/NUTCH-2559
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.14
>Reporter: Gerard Bouchar
>Priority: Major
> Fix For: 1.15
>
>
> Some servers invalidly add colons after the HTTP status code in the status 
> line (they can send _HTTP/1.1 404: Not found_ instead of _HTTP/1.1 404 Not 
> found_ for instance). Browsers can handle that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (NUTCH-2558) protocol-http cannot handle a missing HTTP status line

2018-06-12 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2558.

Resolution: Fixed

> protocol-http cannot handle a missing HTTP status line
> --
>
> Key: NUTCH-2558
> URL: https://issues.apache.org/jira/browse/NUTCH-2558
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.14
>Reporter: Gerard Bouchar
>Priority: Major
> Fix For: 1.15
>
>
> Some servers invalidly send an HTTP body directly without a status line or 
> headers. Browsers handle that, protocol-http doesn't:
>  * Example: [https://app.unitymedia.de/]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (NUTCH-2556) protocol-http makes invalid HTTP/1.0 requests

2018-06-12 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2556.

Resolution: Fixed

HTTP/1.1 is now the default for protocol-http but setting http.useHttp11 = 
false will make it sent HTTP/1.0 requests.

> protocol-http makes invalid HTTP/1.0 requests
> -
>
> Key: NUTCH-2556
> URL: https://issues.apache.org/jira/browse/NUTCH-2556
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.14
>Reporter: Gerard Bouchar
>Priority: Major
> Fix For: 1.15
>
>
> protocol-http advertises its requests as being HTTP/1.0, but sends an 
> _Accept-Encoding_ request header, that is defined only in HTTP/1.1. This 
> confuses some web servers
>  * Example: 
> [http://www.hansamanuals.com/main/english/none/theconf___987/manuals/version___82/hwconvindex.htm]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (NUTCH-2555) URL normalization problem: path not starting with a '/'

2018-06-12 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2555.

Resolution: Fixed

> URL normalization problem: path not starting with a '/'
> ---
>
> Key: NUTCH-2555
> URL: https://issues.apache.org/jira/browse/NUTCH-2555
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.14
>Reporter: Gerard Bouchar
>Priority: Major
> Fix For: 1.15
>
>
> When an URL does not have a path but has GET parameters (for instance 
> '[http://example.com?a=1')|http://example.com/?a=1%27)] it should be 
> normalized to add a '/' at the beginning of the path (giving 
> [http://example.com/?a=1|http://example.com/?a=1%27)]). Our logs show that 
> non-normalized URLs reach protocol-http, which then uses URL::getFile() to 
> get the path, and tries to send an invalid HTTP request:
> GET ?a=1 HTTP/1.0
> instead of
> GET /?a=1 HTTP/1.0
>  
> Example URL for which this poses a problem: 
> [http://news.fx678.com?171|http://news.fx678.com/?171]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2558) protocol-http cannot handle a missing HTTP status line

2018-06-12 Thread Hudson (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509869#comment-16509869
 ] 

Hudson commented on NUTCH-2558:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3534 (See 
[https://builds.apache.org/job/Nutch-trunk/3534/])
NUTCH-2558 protocol-http cannot handle a missing HTTP status line (snagel: 
[https://github.com/apache/nutch/commit/146a76ce000f2eba040806258414d9811ac40357])
* (edit) 
src/plugin/protocol-http/src/test/org/apache/nutch/protocol/http/TestBadServerResponses.java
* (edit) 
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java


> protocol-http cannot handle a missing HTTP status line
> --
>
> Key: NUTCH-2558
> URL: https://issues.apache.org/jira/browse/NUTCH-2558
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.14
>Reporter: Gerard Bouchar
>Priority: Major
> Fix For: 1.15
>
>
> Some servers invalidly send an HTTP body directly without a status line or 
> headers. Browsers handle that, protocol-http doesn't:
>  * Example: [https://app.unitymedia.de/]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2564) protocol-http throws an error when the content-length header is not a number

2018-06-12 Thread Hudson (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509867#comment-16509867
 ] 

Hudson commented on NUTCH-2564:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3534 (See 
[https://builds.apache.org/job/Nutch-trunk/3534/])
NUTCH-2564 protocol-http throws an error when the content-length header 
(snagel: 
[https://github.com/apache/nutch/commit/957306ae0399442de655fc897ae82f4cb60a8883])
* (edit) 
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
* (edit) 
src/plugin/protocol-http/src/test/org/apache/nutch/protocol/http/TestBadServerResponses.java


> protocol-http throws an error when the content-length header is not a number
> 
>
> Key: NUTCH-2564
> URL: https://issues.apache.org/jira/browse/NUTCH-2564
> Project: Nutch
>  Issue Type: Sub-task
>Reporter: Gerard Bouchar
>Priority: Major
>
> When a server sends an invalid Content-Length header (one that is not a valid 
> number) with a plain-text http body, browsers simply ignore it, but 
> protocol-http has a strange approach: if the header is composed only of white 
> spaces, it ignores it, but if it contains other characters, it throws an 
> error, preventing us from doing anything with the page.
> It should simply ignore invalid Content-Length headers.
>  
> Relevant code: 
> [https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L354-L359]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2557) protocol-http fails to follow redirections when an HTTP response body is invalid

2018-06-12 Thread Hudson (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509871#comment-16509871
 ] 

Hudson commented on NUTCH-2557:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3534 (See 
[https://builds.apache.org/job/Nutch-trunk/3534/])
NUTCH-2557 protocol-http fails to follow redirections when HTTP response 
(snagel: 
[https://github.com/apache/nutch/commit/d163512d5d2e345dfe6c816a29dc93a108dfd254])
* (edit) 
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
* (edit) 
src/plugin/protocol-http/src/test/org/apache/nutch/protocol/http/TestBadServerResponses.java


> protocol-http fails to follow redirections when an HTTP response body is 
> invalid
> 
>
> Key: NUTCH-2557
> URL: https://issues.apache.org/jira/browse/NUTCH-2557
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.14
>Reporter: Gerard Bouchar
>Priority: Major
> Fix For: 1.15
>
>
> If a server sends a redirection (3XX status code, with a Location header), 
> protocol-http tries to parse the HTTP response body anyway. Thus, if an error 
> occurs while decoding the body, the redirection is not followed and the 
> information is lost. Browsers follow the redirection and close the socket 
> soon as they can.
>  * Example: this page is a redirection to its https version, with an HTTP 
> body containing invalidly gzip encoded contents. Browsers follow the 
> redirection, but nutch throws an error:
>  ** [http://www.webarcelona.net/es/blog?page=2]
>  
> The HttpResponse::getContent class can already return null. I think it should 
> at least return null when parsing the HTTP response body fails.
> Ideally, we would adopt the same behavior as browsers, and not even try 
> parsing the body when the headers indicate a redirection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2560) protocol-http throws an error when an http header spans over multiple lines

2018-06-12 Thread Hudson (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509872#comment-16509872
 ] 

Hudson commented on NUTCH-2560:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3534 (See 
[https://builds.apache.org/job/Nutch-trunk/3534/])
NUTCH-2560 protocol-http throws an error when an http header spans over 
(snagel: 
[https://github.com/apache/nutch/commit/a2771dc0d1f551b8dd1e07609ce978251a05f34a])
* (edit) 
src/plugin/protocol-http/src/test/org/apache/nutch/protocol/http/TestBadServerResponses.java


> protocol-http throws an error when an http header spans over multiple lines
> ---
>
> Key: NUTCH-2560
> URL: https://issues.apache.org/jira/browse/NUTCH-2560
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.14
>Reporter: Gerard Bouchar
>Priority: Major
> Fix For: 1.15
>
>
> Some servers invalidly send headers that span over multiple lines. In that 
> case, browsers simply ignore the subsequent lines, but protocol-http throws 
> an error, thus preventing us from fetching the contents of the page.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2556) protocol-http makes invalid HTTP/1.0 requests

2018-06-12 Thread Hudson (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509866#comment-16509866
 ] 

Hudson commented on NUTCH-2556:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3534 (See 
[https://builds.apache.org/job/Nutch-trunk/3534/])
NUTCH-2556 protocol-http makes invalid HTTP/1.0 requests - use HTTP/1.1 
(snagel: 
[https://github.com/apache/nutch/commit/73d082e3e3f32f71fd526c2e2084d601fa628d60])
* (edit) 
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
* (edit) conf/nutch-default.xml


> protocol-http makes invalid HTTP/1.0 requests
> -
>
> Key: NUTCH-2556
> URL: https://issues.apache.org/jira/browse/NUTCH-2556
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.14
>Reporter: Gerard Bouchar
>Priority: Major
> Fix For: 1.15
>
>
> protocol-http advertises its requests as being HTTP/1.0, but sends an 
> _Accept-Encoding_ request header, that is defined only in HTTP/1.1. This 
> confuses some web servers
>  * Example: 
> [http://www.hansamanuals.com/main/english/none/theconf___987/manuals/version___82/hwconvindex.htm]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2549) protocol-http does not behave the same as browsers

2018-06-12 Thread Hudson (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509864#comment-16509864
 ] 

Hudson commented on NUTCH-2549:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3534 (See 
[https://builds.apache.org/job/Nutch-trunk/3534/])
NUTCH-2549 protocol-http does not behave the same as browsers - add unit 
(snagel: 
[https://github.com/apache/nutch/commit/4cf96820553c7137236e52da0551b084814670f2])
* (edit) src/plugin/protocol-http/src/test/conf/nutch-site-test.xml
* (add) 
src/plugin/protocol-http/src/test/org/apache/nutch/protocol/http/TestBadServerResponses.java
NUTCH-2549 protocol-http does not behave the same as browsers - be (snagel: 
[https://github.com/apache/nutch/commit/2e485cfbdf46461a733cd21e9129f6fa5989f288])
* (edit) 
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java


> protocol-http does not behave the same as browsers
> --
>
> Key: NUTCH-2549
> URL: https://issues.apache.org/jira/browse/NUTCH-2549
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Gerard Bouchar
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
> Attachments: NUTCH-2549.patch
>
>
> We identified the following issues in protocol-http (a plugin implementing 
> the HTTP protocol):
>  * It fails if an url's path does not start with '/'
>  ** Example: [http://news.fx678.com?171|http://news.fx678.com/?171] (browsers 
> correctly rewrite the url as [http://news.fx678.com/?171], while nutch tries 
> to send an invalid HTTP request starting with *GET ?171 HTTP/1.0*.
>  * It advertises its requests as being HTTP/1.0, but sends an 
> _Accept-Encoding_ request header, that is defined only in HTTP/1.1. This 
> confuses some web servers
>  ** Example: 
> [http://www.hansamanuals.com/main/english/none/theconf___987/manuals/version___82/hwconvindex.htm]
>  * If a server sends a redirection (3XX status code, with a Location header), 
> protocol-http tries to parse the HTTP response body anyway. Thus, if an error 
> occurs while decoding the body, the redirection is not followed and the 
> information is lost. Browsers follow the redirection and close the socket 
> soon as they can.
>  ** Example: [http://www.webarcelona.net/es/blog?page=2]
>  * Some servers invalidly send an HTTP body directly without a status line or 
> headers. Browsers handle that, protocol-http doesn't:
>  ** Example: [https://app.unitymedia.de/]
>  * Some servers invalidly add colons after the HTTP status code in the status 
> line (they can send _HTTP/1.1 404: Not found_ instead of _HTTP/1.1 404 Not 
> found_ for instance). Browsers can handle that.
>  * Some servers invalidly send headers that span over multiple lines. In that 
> case, browsers simply ignore the subsequent lines, but protocol-http throws 
> an error, thus preventing us from fetching the contents of the page.
>  * There is no limit over the size of the HTTP headers it reads. A bogus 
> server could send an infinite stream of different HTTP headers and cause the 
> fetcher to go out of memory, or send the same HTTP header repeatedly and 
> cause the fetcher to timeout.
>  * The same goes for the HTTP status line: no check is made concerning its 
> size.
>  * While reading chunked content, if the content size becomes larger than 
> {color:#9876aa}http{color}.getMaxContent(), instead of just stopping, it 
> tries to read a new chunk before having read the previous one completely, 
> resulting in a '{color:#33}bad chunk length' error.{color}
> {color:#33}Additionally (and that concerns protocol-httpclient as well), 
> when reading http headers, for each header, the SpellCheckedMetadata class 
> computes a Levenshtein distance between it and every  known header in the 
> HttpHeaders interface. Not only is that slow, non-standard, and non-conform 
> to browsers' behavior, but it also causes bugs and prevents us from accessing 
> the real headers sent by the HTTP server.{color}
>  * {color:#33}Example: [http://www.taz.de/!443358/] . The server sends a 
> *Client-Transfer-Encoding: chunked* header, but SpellCheckedMetadata corrects 
> it to *Transfer-Encoding: chunked*. Then, HttpResponse (in protocol-http) 
> tries to read the HTTP body as chunked, whereas it is not.{color}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2559) protocol-http cannot handle colons after the HTTP status code

2018-06-12 Thread Hudson (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509868#comment-16509868
 ] 

Hudson commented on NUTCH-2559:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3534 (See 
[https://builds.apache.org/job/Nutch-trunk/3534/])
NUTCH-2559 protocol-http cannot handle colons after the HTTP status code 
(snagel: 
[https://github.com/apache/nutch/commit/9e212a2675234c981a28a96ffa093813ed4274f9])
* (edit) 
src/plugin/protocol-http/src/test/org/apache/nutch/protocol/http/TestBadServerResponses.java
* (edit) 
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java


> protocol-http cannot handle colons after the HTTP status code
> -
>
> Key: NUTCH-2559
> URL: https://issues.apache.org/jira/browse/NUTCH-2559
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.14
>Reporter: Gerard Bouchar
>Priority: Major
> Fix For: 1.15
>
>
> Some servers invalidly add colons after the HTTP status code in the status 
> line (they can send _HTTP/1.1 404: Not found_ instead of _HTTP/1.1 404 Not 
> found_ for instance). Browsers can handle that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2563) HTTP header spellchecking issues

2018-06-12 Thread Hudson (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509870#comment-16509870
 ] 

Hudson commented on NUTCH-2563:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3534 (See 
[https://builds.apache.org/job/Nutch-trunk/3534/])
NUTCH-2563 HTTP header spellchecking issues ("Client-Transfer-Encoding" 
(snagel: 
[https://github.com/apache/nutch/commit/381e82ff0a891d899ac8541d6a30f0d12633d247])
* (edit) src/java/org/apache/nutch/metadata/HttpHeaders.java
* (edit) 
src/plugin/protocol-http/src/test/org/apache/nutch/protocol/http/TestBadServerResponses.java
* (edit) src/java/org/apache/nutch/metadata/SpellCheckedMetadata.java


> HTTP header spellchecking issues
> 
>
> Key: NUTCH-2563
> URL: https://issues.apache.org/jira/browse/NUTCH-2563
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.14
>Reporter: Gerard Bouchar
>Priority: Major
> Fix For: 1.15
>
>
> {color:#33}When reading http headers, for each header, the 
> SpellCheckedMetadata class computes a Levenshtein distance between it and 
> every  known header in the HttpHeaders interface. Not only is that slow, 
> non-standard, and non-conform to browsers' behavior, but it also causes bugs 
> and prevents us from accessing the real headers sent by the HTTP 
> server.{color}
>  * {color:#33}Example: [http://www.taz.de/!443358/] . The server sends a 
> *Client-Transfer-Encoding: chunked* header, but SpellCheckedMetadata corrects 
> it to *Transfer-Encoding: chunked*. Then, HttpResponse (in protocol-http) 
> tries to read the HTTP body as chunked, whereas it is not.{color}
> {color:#33}I personally think that HTTP header spell checking is a bad 
> idea, and that this logic should be completely removed. But if it were to be 
> kept, the threshold (SpellCheckedMetadata.TRESHOLD_DIVIDER) should be higher 
> (we internally set it to 5 as a temporary fix for this issue){color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2555) URL normalization problem: path not starting with a '/'

2018-06-12 Thread Hudson (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509865#comment-16509865
 ] 

Hudson commented on NUTCH-2555:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3534 (See 
[https://builds.apache.org/job/Nutch-trunk/3534/])
NUTCH-2555 URL normalization problem: path not starting with a '/' For (snagel: 
[https://github.com/apache/nutch/commit/6239655b6fd959b637ae3948f616f393aa99f159])
* (edit) 
src/plugin/urlnormalizer-basic/src/test/org/apache/nutch/net/urlnormalizer/basic/TestBasicURLNormalizer.java
* (edit) 
src/plugin/protocol-http/src/test/org/apache/nutch/protocol/http/TestBadServerResponses.java
* (edit) 
src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java
* (edit) 
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java


> URL normalization problem: path not starting with a '/'
> ---
>
> Key: NUTCH-2555
> URL: https://issues.apache.org/jira/browse/NUTCH-2555
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.14
>Reporter: Gerard Bouchar
>Priority: Major
> Fix For: 1.15
>
>
> When an URL does not have a path but has GET parameters (for instance 
> '[http://example.com?a=1')|http://example.com/?a=1%27)] it should be 
> normalized to add a '/' at the beginning of the path (giving 
> [http://example.com/?a=1|http://example.com/?a=1%27)]). Our logs show that 
> non-normalized URLs reach protocol-http, which then uses URL::getFile() to 
> get the path, and tries to send an invalid HTTP request:
> GET ?a=1 HTTP/1.0
> instead of
> GET /?a=1 HTTP/1.0
>  
> Example URL for which this poses a problem: 
> [http://news.fx678.com?171|http://news.fx678.com/?171]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2040) Upgrade to recent version of Crawler-Commons

2018-06-12 Thread Hudson (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509848#comment-16509848
 ] 

Hudson commented on NUTCH-2040:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1612 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1612/])
NUTCH-2040 Upgrade to recent version of Crawler-Commons (snagel: 
[https://github.com/apache/nutch/commit/cd46e1b740d1086716f4f7de991f04ed3685a5b5])
* (edit) ivy/ivy.xml


> Upgrade to recent version of Crawler-Commons
> 
>
> Key: NUTCH-2040
> URL: https://issues.apache.org/jira/browse/NUTCH-2040
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol
>Affects Versions: 2.3.1
>Reporter: Lewis John McGibbney
>Priority: Major
> Fix For: 2.4
>
>
> [Crawler Commons 
> 0.6|https://github.com/crawler-commons/crawler-commons#11th-june-2015---crawler-commons-06-is-released]
>  was released. We should upgrade.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2549) protocol-http does not behave the same as browsers

2018-06-12 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509804#comment-16509804
 ] 

ASF GitHub Bot commented on NUTCH-2549:
---

sebastian-nagel closed pull request #347: NUTCH-2549  protocol-http does not 
behave the same as browsers
URL: https://github.com/apache/nutch/pull/347
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/build.xml b/build.xml
index 1d680d0bd..d4836a4f2 100644
--- a/build.xml
+++ b/build.xml
@@ -215,6 +215,7 @@
   
   
   
+  
   
   
   
@@ -673,6 +674,7 @@
   
   
   
+  
   
   
   
@@ -1107,6 +1109,8 @@
 
 
 
+
+
 
 
 
diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml
index 37f73b8cd..cb2d2df50 100644
--- a/conf/nutch-default.xml
+++ b/conf/nutch-default.xml
@@ -277,6 +277,15 @@
   
 
 
+
+  http.proxy.type
+  HTTP
+  
+Proxy type: HTTP or SOCKS (cf. java.net.Proxy.Type).
+Note: supported by protocol-okhttp.
+  
+
+
 
   http.proxy.exception.list
   
@@ -301,9 +310,22 @@
 
 
   http.useHttp11
+  true
+  
+If true, use HTTP 1.1, if false use HTTP 1.0 .
+  
+
+
+
+  http.useHttp2
   false
-  NOTE: at the moment this works only for protocol-httpclient.
-  If true, use HTTP 1.1, if false use HTTP 1.0 .
+  
+If true try HTTP/2 and fall-back to HTTP/1.1 if HTTP/2 not
+supported, if false use always HTTP/1.1.
+
+NOTE: HTTP/2 is currently only supported by protocol-okhttp and
+requires at runtime Java 9 or a modified Java 8 with support for
+ALPN (Application Layer Protocol Negotiation).
   
 
 
diff --git a/src/java/org/apache/nutch/metadata/HttpHeaders.java 
b/src/java/org/apache/nutch/metadata/HttpHeaders.java
index 71a66f66c..b7700e5d3 100644
--- a/src/java/org/apache/nutch/metadata/HttpHeaders.java
+++ b/src/java/org/apache/nutch/metadata/HttpHeaders.java
@@ -28,6 +28,8 @@
 
   public static final String TRANSFER_ENCODING = "Transfer-Encoding";
 
+  public static final String CLIENT_TRANSFER_ENCODING = 
"Client-Transfer-Encoding";
+
   public static final String CONTENT_ENCODING = "Content-Encoding";
 
   public static final String CONTENT_LANGUAGE = "Content-Language";
@@ -48,4 +50,8 @@
 
   public static final String LOCATION = "Location";
 
+  public static final String IF_MODIFIED_SINCE = "If-Modified-Since";
+
+  public static final String USER_AGENT = "User-Agent";
+
 }
diff --git a/src/java/org/apache/nutch/metadata/SpellCheckedMetadata.java 
b/src/java/org/apache/nutch/metadata/SpellCheckedMetadata.java
index 9434cab60..fdbf1b62c 100644
--- a/src/java/org/apache/nutch/metadata/SpellCheckedMetadata.java
+++ b/src/java/org/apache/nutch/metadata/SpellCheckedMetadata.java
@@ -32,9 +32,10 @@
 public class SpellCheckedMetadata extends Metadata {
 
   /**
-   * Treshold divider.
+   * Threshold divider to calculate max. Levenshtein distance for misspelled
+   * header field names:
* 
-   * threshold = searched.length() / TRESHOLD_DIVIDER;
+   * threshold = Math.min(3, searched.length() / 
TRESHOLD_DIVIDER);
*/
   private static final int TRESHOLD_DIVIDER = 3;
 
@@ -112,7 +113,7 @@ public static String getNormalizedName(final String name) {
 String value = NAMES_IDX.get(searched);
 
 if ((value == null) && (normalized != null)) {
-  int threshold = searched.length() / TRESHOLD_DIVIDER;
+  int threshold = Math.min(3, searched.length() / TRESHOLD_DIVIDER);
   for (int i = 0; i < normalized.length && value == null; i++) {
 if (StringUtils.getLevenshteinDistance(searched, normalized[i]) < 
threshold) {
   value = NAMES_IDX.get(normalized[i]);
diff --git a/src/java/org/apache/nutch/net/protocols/Response.java 
b/src/java/org/apache/nutch/net/protocols/Response.java
index c9139bd6c..7096c934d 100644
--- a/src/java/org/apache/nutch/net/protocols/Response.java
+++ b/src/java/org/apache/nutch/net/protocols/Response.java
@@ -26,6 +26,32 @@
  */
 public interface Response extends HttpHeaders {
 
+  /** Key to hold the HTTP request if store.http.request is true 
*/
+  public static final String REQUEST = "_request_";
+
+  /**
+   * Key to hold the HTTP response header if store.http.headers is
+   * true
+   */
+  public static final String RESPONSE_HEADERS = "_response.headers_";
+
+  /**
+   * Key to hold the IP address the request is sent to if
+   * store.ip.address is true
+   */
+  public static final String IP_ADDRESS = "_ip_";
+
+  /**
+   * Key to hold the time when the page has been fetched
+   */
+  public static final String FETCH_TIME = "nutch.fetch.time";
+
+  /**
+   * Key to hold boolean whether content has been trimmed because it

[jira] [Commented] (NUTCH-2565) MergeDB incorrectly handles unfetched CrawlDatums

2018-06-12 Thread Jurian Broertjes (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509801#comment-16509801
 ] 

Jurian Broertjes commented on NUTCH-2565:
-

Maybe it would be sufficient to only test on STATUS_DB_UNFETCHED in 
calculateLastFetchTime(datum), but fallback on CrawlDatum.getFetchTime() in the 
merger and pick the newest according to that.

That way we could also just pick the retries value from the newest one and keep 
it simple.

I'll add a PR later for review

> MergeDB incorrectly handles unfetched CrawlDatums
> -
>
> Key: NUTCH-2565
> URL: https://issues.apache.org/jira/browse/NUTCH-2565
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Jurian Broertjes
>Priority: Minor
>
> I ran into this issue when merging a crawlDB originating from sitemaps into 
> our normal crawlDB. CrawlDatums are merged based on output of 
> AbstractFetchSchedule::calculateLastFetchTime(). When CrawlDatums are 
> unfetched, this can overwrite fetchTime or other stuff.
> I assume this is a bug and have a simple fix for it that checks if CrawlDatum 
> has status db_unfetched.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2595) Upgrade crawler-commons dependency to 0.10

2018-06-12 Thread Hudson (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509785#comment-16509785
 ] 

Hudson commented on NUTCH-2595:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3533 (See 
[https://builds.apache.org/job/Nutch-trunk/3533/])
NUTCH-2595 Upgrade crawler-commons dependency to 0.10 (snagel: 
[https://github.com/apache/nutch/commit/49fa75c99fe009a9a0d89d663af1c6c70d83e06e])
* (edit) ivy/ivy.xml


> Upgrade crawler-commons dependency to 0.10
> --
>
> Key: NUTCH-2595
> URL: https://issues.apache.org/jira/browse/NUTCH-2595
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.14
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
>
> See 
> [CHANGES|https://github.com/crawler-commons/crawler-commons/blob/crawler-commons-0.10/CHANGES.txt]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2576) HTTP protocol plugin based on okhttp

2018-06-12 Thread Hudson (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509786#comment-16509786
 ] 

Hudson commented on NUTCH-2576:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3533 (See 
[https://builds.apache.org/job/Nutch-trunk/3533/])
NUTCH-2576 HTTP protocol implementation based on okhttp - derived from (snagel: 
[https://github.com/apache/nutch/commit/32860a5906834dce0408ae471a376375d40e3653])
* (edit) src/java/org/apache/nutch/net/protocols/Response.java
* (add) 
src/plugin/protocol-okhttp/src/java/org/apache/nutch/protocol/okhttp/OkHttp.java
* (edit) 
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
* (add) 
src/plugin/protocol-okhttp/src/java/org/apache/nutch/protocol/okhttp/package-info.java
* (add) src/plugin/protocol-okhttp/plugin.xml
* (edit) conf/nutch-default.xml
* (edit) src/java/org/apache/nutch/metadata/HttpHeaders.java
* (add) src/plugin/protocol-okhttp/jsp/brokenpage.jsp
* (add) src/plugin/protocol-okhttp/ivy.xml
* (add) src/plugin/protocol-okhttp/src/test/conf/nutch-site-test.xml
* (add) 
src/plugin/protocol-okhttp/src/java/org/apache/nutch/protocol/okhttp/OkHttpResponse.java
* (edit) build.xml
* (add) src/plugin/protocol-okhttp/jsp/basic-http.jsp
* (add) src/plugin/protocol-okhttp/jsp/redirect302.jsp
* (edit) 
src/plugin/protocol-htmlunit/src/java/org/apache/nutch/protocol/htmlunit/HttpResponse.java
* (edit) src/plugin/build.xml
* (add) src/plugin/protocol-okhttp/build.xml
* (edit) 
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
* (add) 
src/plugin/protocol-okhttp/src/test/org/apache/nutch/protocol/okhttp/TestProtocolOkHttp.java
* (add) src/plugin/protocol-okhttp/jsp/redirect301.jsp
NUTCH-2576 HTTP protocol implementation based on okhttp - fix: copy (snagel: 
[https://github.com/apache/nutch/commit/659e1c8900c65d96b0a498d602992cc11c9430d6])
* (edit) 
src/plugin/protocol-okhttp/src/java/org/apache/nutch/protocol/okhttp/OkHttpResponse.java
NUTCH-2576 HTTP protocol implementation based on okhttp - do not catch (snagel: 
[https://github.com/apache/nutch/commit/f598db71c22c68aa8dc00028609bef9ab6b94158])
* (edit) 
src/plugin/protocol-okhttp/src/java/org/apache/nutch/protocol/okhttp/OkHttpResponse.java
NUTCH-2576 HTTP protocol implementation based on okhttp - set Cookie (snagel: 
[https://github.com/apache/nutch/commit/dbdb40bcfacf8d53076bc24f768c5dde0832f742])
* (edit) 
src/plugin/protocol-okhttp/src/java/org/apache/nutch/protocol/okhttp/OkHttpResponse.java
NUTCH-2576 HTTP protocol implementation based on okhttp - port unit (snagel: 
[https://github.com/apache/nutch/commit/466a0ed4b5398d7624931ad5115c4b50624dfd12])
* (edit) src/plugin/protocol-okhttp/src/test/conf/nutch-site-test.xml
* (add) 
src/plugin/protocol-okhttp/src/test/org/apache/nutch/protocol/okhttp/TestBadServerResponses.java
NUTCH-2576 HTTP protocol implementation based on okhttp - change port (snagel: 
[https://github.com/apache/nutch/commit/f1aa728b113a716db783690e220c9e03318a62f1])
* (edit) 
src/plugin/protocol-okhttp/src/test/org/apache/nutch/protocol/okhttp/TestBadServerResponses.java


> HTTP protocol plugin based on okhttp
> 
>
> Key: NUTCH-2576
> URL: https://issues.apache.org/jira/browse/NUTCH-2576
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, protocol
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
>
> [Okhttp|http://square.github.io/okhttp/] is an Apache2-licensed http library 
> which supports HTTP/2. [~jnioche]'s implementation 
> [storm-crawler#443|https://github.com/DigitalPebble/storm-crawler/issues/443] 
> proves that it should be straightforward to implement a Nutch protocol plugin 
> using okhttp. A recent HTTP protocol implementation should also fix (most of) 
> the issues reported in NUTCH-2549.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2012) Merge parsechecker and indexchecker

2018-06-12 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509779#comment-16509779
 ] 

ASF GitHub Bot commented on NUTCH-2012:
---

sju opened a new pull request #348: NUTCH-2012: output fix
URL: https://github.com/apache/nutch/pull/348
 
 
   Thanks for your contribution to [Apache Nutch](http://nutch.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Nutch issue 
tracker](https://issues.apache.org/jira/projects/NUTCH) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`NUTCH-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages
   * commits are squashed into a single one (or few commits for larger changes)
   * Java source code follows [Nutch Eclipse Code Formatting 
rules](https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml)
   * Nutch is successfully built and unit tests pass by running `ant clean 
runtime test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* master branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled master branch.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Nutch 
in general, please sign up for the [Nutch mailing 
list](http://nutch.apache.org/mailing_lists.html). Thanks!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Merge parsechecker and indexchecker
> ---
>
> Key: NUTCH-2012
> URL: https://issues.apache.org/jira/browse/NUTCH-2012
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.10
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.15
>
>
> ParserChecker and IndexingFiltersChecker have evolved from simple tools to 
> check parsers and parsefilters resp. indexing filters to powerful tools which 
> emulate the crawling of a single URL/document:
> - check robots.txt (NUTCH-2002)
> - follow redirects (NUTCH-2004)
> Keeping both tools in sync takes extra work (cf. NUTCH-1757/NUTCH-2006, also 
> NUTCH-2002, NUTCH-2004 are done only for parsechecker). It's time to merge 
> them
> * either into one general debugging tool, keeping parsechecker and 
> indexchecker as aliases
> * centralize common code in one utility class



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (NUTCH-2040) Upgrade to recent version of Crawler-Commons

2018-06-12 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2040.

Resolution: Implemented

> Upgrade to recent version of Crawler-Commons
> 
>
> Key: NUTCH-2040
> URL: https://issues.apache.org/jira/browse/NUTCH-2040
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol
>Affects Versions: 2.3.1
>Reporter: Lewis John McGibbney
>Priority: Major
> Fix For: 2.4
>
>
> [Crawler Commons 
> 0.6|https://github.com/crawler-commons/crawler-commons#11th-june-2015---crawler-commons-06-is-released]
>  was released. We should upgrade.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2040) Upgrade to recent version of Crawler-Commons

2018-06-12 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509768#comment-16509768
 ] 

ASF GitHub Bot commented on NUTCH-2040:
---

sebastian-nagel closed pull request #346: NUTCH-2040 Upgrade to recent version 
of Crawler-Commons
URL: https://github.com/apache/nutch/pull/346
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/ivy/ivy.xml b/ivy/ivy.xml
index 62e151e06..1b8d71494 100644
--- a/ivy/ivy.xml
+++ b/ivy/ivy.xml
@@ -74,7 +74,7 @@
 
 
 
-
+
 
 
 


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Upgrade to recent version of Crawler-Commons
> 
>
> Key: NUTCH-2040
> URL: https://issues.apache.org/jira/browse/NUTCH-2040
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol
>Affects Versions: 2.3.1
>Reporter: Lewis John McGibbney
>Priority: Major
> Fix For: 2.4
>
>
> [Crawler Commons 
> 0.6|https://github.com/crawler-commons/crawler-commons#11th-june-2015---crawler-commons-06-is-released]
>  was released. We should upgrade.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2576) HTTP protocol plugin based on okhttp

2018-06-12 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509766#comment-16509766
 ] 

ASF GitHub Bot commented on NUTCH-2576:
---

sebastian-nagel closed pull request #328: NUTCH-2576 HTTP protocol 
implementation based on okhttp
URL: https://github.com/apache/nutch/pull/328
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/build.xml b/build.xml
index 1d680d0bd..d4836a4f2 100644
--- a/build.xml
+++ b/build.xml
@@ -215,6 +215,7 @@
   
   
   
+  
   
   
   
@@ -673,6 +674,7 @@
   
   
   
+  
   
   
   
@@ -1107,6 +1109,8 @@
 
 
 
+
+
 
 
 
diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml
index 37f73b8cd..fcedc6df0 100644
--- a/conf/nutch-default.xml
+++ b/conf/nutch-default.xml
@@ -277,6 +277,15 @@
   
 
 
+
+  http.proxy.type
+  HTTP
+  
+Proxy type: HTTP or SOCKS (cf. java.net.Proxy.Type).
+Note: supported by protocol-okhttp.
+  
+
+
 
   http.proxy.exception.list
   
@@ -307,6 +316,19 @@
   
 
 
+
+  http.useHttp2
+  false
+  
+If true try HTTP/2 and fall-back to HTTP/1.1 if HTTP/2 not
+supported, if false use always HTTP/1.1.
+
+NOTE: HTTP/2 is currently only supported by protocol-okhttp and
+requires at runtime Java 9 or a modified Java 8 with support for
+ALPN (Application Layer Protocol Negotiation).
+  
+
+
 
   http.accept.language
   en-us,en-gb,en;q=0.7,*;q=0.3
diff --git a/src/java/org/apache/nutch/metadata/HttpHeaders.java 
b/src/java/org/apache/nutch/metadata/HttpHeaders.java
index 71a66f66c..a3aec1dbb 100644
--- a/src/java/org/apache/nutch/metadata/HttpHeaders.java
+++ b/src/java/org/apache/nutch/metadata/HttpHeaders.java
@@ -48,4 +48,8 @@
 
   public static final String LOCATION = "Location";
 
+  public static final String IF_MODIFIED_SINCE = "If-Modified-Since";
+
+  public static final String USER_AGENT = "User-Agent";
+
 }
diff --git a/src/java/org/apache/nutch/net/protocols/Response.java 
b/src/java/org/apache/nutch/net/protocols/Response.java
index c9139bd6c..7096c934d 100644
--- a/src/java/org/apache/nutch/net/protocols/Response.java
+++ b/src/java/org/apache/nutch/net/protocols/Response.java
@@ -26,6 +26,32 @@
  */
 public interface Response extends HttpHeaders {
 
+  /** Key to hold the HTTP request if store.http.request is true 
*/
+  public static final String REQUEST = "_request_";
+
+  /**
+   * Key to hold the HTTP response header if store.http.headers is
+   * true
+   */
+  public static final String RESPONSE_HEADERS = "_response.headers_";
+
+  /**
+   * Key to hold the IP address the request is sent to if
+   * store.ip.address is true
+   */
+  public static final String IP_ADDRESS = "_ip_";
+
+  /**
+   * Key to hold the time when the page has been fetched
+   */
+  public static final String FETCH_TIME = "nutch.fetch.time";
+
+  /**
+   * Key to hold boolean whether content has been trimmed because it exceeds
+   * http.content.limit
+   */
+  public static final String TRIMMED_CONTENT = "http.content.trimmed";
+
   /** Returns the URL used to retrieve this response. */
   public URL getUrl();
 
diff --git a/src/plugin/build.xml b/src/plugin/build.xml
index 5a3a8c910..a9cb912cc 100755
--- a/src/plugin/build.xml
+++ b/src/plugin/build.xml
@@ -73,6 +73,7 @@
 
 
 
+
 
 
 
@@ -132,6 +133,7 @@
  
  
  
+ 
  
  
  
@@ -209,6 +211,7 @@
 
 
 
+
 
 
 
diff --git 
a/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java 
b/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
index d9284c9aa..1cb2bb151 100644
--- 
a/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
+++ 
b/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
@@ -20,6 +20,8 @@
 import java.io.BufferedReader;
 import java.io.IOException;
 import java.io.Reader;
+import java.net.Proxy;
+import java.net.URI;
 import java.net.URL;
 import java.util.ArrayList;
 import java.util.Arrays;
@@ -70,8 +72,11 @@
   /** The proxy port. */
   protected int proxyPort = 8080;
   
+  /** The proxy port. */
+  protected Proxy.Type proxyType = Proxy.Type.HTTP;
+
   /** The proxy exception list. */
-  protected HashMap proxyException = new HashMap(); 
+  protected HashMap proxyException = new HashMap<>();
 
   /** Indicates if a proxy is used */
   protected boolean useProxy = false;
@@ -89,7 +94,7 @@
   /** The "Accept-Language" request header value. */
   protected String acceptLanguage = "en-us,en-gb,en;q=0.7,*;q=0.3";
 
-  /** The "Accept-Language"

[jira] [Resolved] (NUTCH-2576) HTTP protocol plugin based on okhttp

2018-06-12 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2576.

Resolution: Implemented

> HTTP protocol plugin based on okhttp
> 
>
> Key: NUTCH-2576
> URL: https://issues.apache.org/jira/browse/NUTCH-2576
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, protocol
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
>
> [Okhttp|http://square.github.io/okhttp/] is an Apache2-licensed http library 
> which supports HTTP/2. [~jnioche]'s implementation 
> [storm-crawler#443|https://github.com/DigitalPebble/storm-crawler/issues/443] 
> proves that it should be straightforward to implement a Nutch protocol plugin 
> using okhttp. A recent HTTP protocol implementation should also fix (most of) 
> the issues reported in NUTCH-2549.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (NUTCH-2576) HTTP protocol plugin based on okhttp

2018-06-12 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2576:
--

Assignee: Sebastian Nagel

> HTTP protocol plugin based on okhttp
> 
>
> Key: NUTCH-2576
> URL: https://issues.apache.org/jira/browse/NUTCH-2576
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, protocol
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
>
> [Okhttp|http://square.github.io/okhttp/] is an Apache2-licensed http library 
> which supports HTTP/2. [~jnioche]'s implementation 
> [storm-crawler#443|https://github.com/DigitalPebble/storm-crawler/issues/443] 
> proves that it should be straightforward to implement a Nutch protocol plugin 
> using okhttp. A recent HTTP protocol implementation should also fix (most of) 
> the issues reported in NUTCH-2549.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work started] (NUTCH-2576) HTTP protocol plugin based on okhttp

2018-06-12 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2576 started by Sebastian Nagel.
--
> HTTP protocol plugin based on okhttp
> 
>
> Key: NUTCH-2576
> URL: https://issues.apache.org/jira/browse/NUTCH-2576
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, protocol
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
>
> [Okhttp|http://square.github.io/okhttp/] is an Apache2-licensed http library 
> which supports HTTP/2. [~jnioche]'s implementation 
> [storm-crawler#443|https://github.com/DigitalPebble/storm-crawler/issues/443] 
> proves that it should be straightforward to implement a Nutch protocol plugin 
> using okhttp. A recent HTTP protocol implementation should also fix (most of) 
> the issues reported in NUTCH-2549.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (NUTCH-2595) Upgrade crawler-commons dependency to 0.10

2018-06-12 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2595:
--

Assignee: Sebastian Nagel

> Upgrade crawler-commons dependency to 0.10
> --
>
> Key: NUTCH-2595
> URL: https://issues.apache.org/jira/browse/NUTCH-2595
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.14
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
>
> See 
> [CHANGES|https://github.com/crawler-commons/crawler-commons/blob/crawler-commons-0.10/CHANGES.txt]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (NUTCH-2595) Upgrade crawler-commons dependency to 0.10

2018-06-12 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2595.

Resolution: Implemented

> Upgrade crawler-commons dependency to 0.10
> --
>
> Key: NUTCH-2595
> URL: https://issues.apache.org/jira/browse/NUTCH-2595
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.14
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
>
> See 
> [CHANGES|https://github.com/crawler-commons/crawler-commons/blob/crawler-commons-0.10/CHANGES.txt]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2595) Upgrade crawler-commons dependency to 0.10

2018-06-12 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509763#comment-16509763
 ] 

ASF GitHub Bot commented on NUTCH-2595:
---

sebastian-nagel closed pull request #345: NUTCH-2595 Upgrade crawler-commons 
dependency to 0.10
URL: https://github.com/apache/nutch/pull/345
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/ivy/ivy.xml b/ivy/ivy.xml
index eb29c9ddb..9b8d667b8 100644
--- a/ivy/ivy.xml
+++ b/ivy/ivy.xml
@@ -74,9 +74,7 @@
 

 
-   
-   
-   
+   
 




 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Upgrade crawler-commons dependency to 0.10
> --
>
> Key: NUTCH-2595
> URL: https://issues.apache.org/jira/browse/NUTCH-2595
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.14
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
>
> See 
> [CHANGES|https://github.com/crawler-commons/crawler-commons/blob/crawler-commons-0.10/CHANGES.txt]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2012) Merge parsechecker and indexchecker

2018-06-12 Thread Jurian Broertjes (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509686#comment-16509686
 ] 

Jurian Broertjes commented on NUTCH-2012:
-

It looks like the process() function still uses System.out.println for output, 
instead of the output StringBuilder. I can supply a small PR to fix it.

> Merge parsechecker and indexchecker
> ---
>
> Key: NUTCH-2012
> URL: https://issues.apache.org/jira/browse/NUTCH-2012
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.10
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.15
>
>
> ParserChecker and IndexingFiltersChecker have evolved from simple tools to 
> check parsers and parsefilters resp. indexing filters to powerful tools which 
> emulate the crawling of a single URL/document:
> - check robots.txt (NUTCH-2002)
> - follow redirects (NUTCH-2004)
> Keeping both tools in sync takes extra work (cf. NUTCH-1757/NUTCH-2006, also 
> NUTCH-2002, NUTCH-2004 are done only for parsechecker). It's time to merge 
> them
> * either into one general debugging tool, keeping parsechecker and 
> indexchecker as aliases
> * centralize common code in one utility class



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: Nutch 1.14 issues

2018-06-12 Thread Sebastian Nagel

Hi Arkadi,

thanks for your feedback and suggestions.
I can understand your frustration but I also want to clarify:

- Arch is a nice project, for sure. But Arch is GPL licensed
  which makes contributions a one-way route (Nutch -> Arch)
  and causes me even not to look into the Arch sources. Sorry.

- Please take the time to split your list of issues into separate
  requests on the mailing list or open separate Jira issues.
  Also take care that the problems are reproducible by sharing
  documents failed to parse, log snippets, config files, etc.

- Sorry about NUTCH-2071, I took this mainly as a class path issue
  in the parse-tika plugin (which is solved). Now I understand better
  what your objective is and I'll will review and try to fix it
  (in combination with NUTCH-1993). But again: please take the time
  to explain your objectives, ping committers if fixes make no progress,
  etc.

- Nutch is a community project. There are no "paid" committers. This
  means although some of us are paid to configure/operate/adapt crawlers
  nobody is delegated to fix issues, support Nutch users, etc.
  That's voluntary work.

- Everybody is welcome to contribute (patches, documentation, support
  on the mailing list, etc.)  Because Nutch is a small project this
  will help us definitely.


Thanks,
Sebastian



On 06/12/2018 08:46 AM, arkadi.kosmy...@csiro.au wrote:
> Hi guys,
> 
>  
> 
> I am porting Arch (https://www.atnf.csiro.au/computing/software/arch/) to 
> Nutch 1.14 and Solr 7.2,
> and I have come across a few serious issues, of which you should be aware:
> 
>  
> 
> 1.   The Nutch-2071 is still an issue in 1.14, because the returned 
> parseResult is never null.
> If a parser fails to parse a document, it returns an empty result, but not 
> null. This means that,
> from a chain of parser candidates, only the first one has a chance to try to 
> parse the document.
> 
> 2.   Nutch adopted Tika as a general parsing tool, and stopped supporting 
> “legacy” parsing (OO,
> MS) plugins. I continued using them and hoped to stop supporting them in the 
> next version of Arch I
> am preparing to be released, but I still can’t do it, because Tika fails to 
> parse too many documents
> on our site. But, when I reinforce Tika with the legacy parsers, I achieve 
> almost 100% parsing
> success rate. This is why NUTCH-2071 is important for Arch. I think you 
> should bring back legacy
> parsers to Nutch, because the quality of parsing of “real life” data, such as 
> ours, is not great
> without them.
> 
> 3.   The lines defining fall-back (*) plugin in parse-plugins.xml are not 
> effective, because
> they are ignored, as long as there is at least one plugin claiming * in its 
> plugin.xml file. In some
> cases, Nutch assigns * capability to plugins that don’t even claim it. For 
> example, I can’t
> understand, why Arch content blocking plugin gets it.
> 
> 4.   In earlier versions of Nutch, use of the native libraries really 
> helped. It reduced
> crawling of our site from a couple of days to 6-7 hours. In Nutch 1.14, I 
> don’t notice this. I’ve
> obtained Hadoop libraries, placed them where they are expected, even inserted 
> an explicit load
> library call in my code, but I still don’t notice any significant time 
> savings.
> 
> 5.   The Feed plugin seems to have a major problem. The line 102 in  
> FeedIndexingFilter.java
> generated a NumberFormatException (which caused the failure of the entire 
> crawling process!) because
> it was trying to parse a date in string format, not a number. Given that this 
> metadata piece was
> generated by the feed parser (same plugin), it seems that the plugin is in 
> disagreement with itself.
> 
> 6.   This is less important, but when Tika fails to parse a document, it 
> generates a scary error
> message and ugly stack trace. I think this should be a one line warning, 
> because other parsers may
> still parse this document successfully.
> 
>  
> 
> Hope this helps.
> 
>  
> 
> Regards,
> 
>  
> 
> Arkadi
>

[jira] [Commented] (NUTCH-2565) MergeDB incorrectly handles unfetched CrawlDatums

2018-06-12 Thread Sebastian Nagel (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509596#comment-16509596
 ] 

Sebastian Nagel commented on NUTCH-2565:


I thought first about making the condition in calculateLastFetchTime(datum) 
more strict:
{code}
if (datum.getStatus() == CrawlDatum.STATUS_DB_UNFETCHED && 
datum.getRetriesSinceFetch() == 0) {
  return 0L;
{code}
This will guarantee that we do not prefer an older DB_FETCHED over the newer 
DB_UNFETCHED with a "transient" failure. If there are two DB_UNFETCHED with 
retries > 0 to be merged, it's important that
# the fetch time is the latest (for scheduling)
# yes, we could sum the retry counts but then we need also to trigger a status 
change if retries > db.fetch.retry.max. We need also make sure not to cause a 
retry counter overflow (it's only a signed byte) if many CrawlDbs are merged. 
In short, for me this looks too complex. What do you mean?

> MergeDB incorrectly handles unfetched CrawlDatums
> -
>
> Key: NUTCH-2565
> URL: https://issues.apache.org/jira/browse/NUTCH-2565
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Jurian Broertjes
>Priority: Minor
>
> I ran into this issue when merging a crawlDB originating from sitemaps into 
> our normal crawlDB. CrawlDatums are merged based on output of 
> AbstractFetchSchedule::calculateLastFetchTime(). When CrawlDatums are 
> unfetched, this can overwrite fetchTime or other stuff.
> I assume this is a bug and have a simple fix for it that checks if CrawlDatum 
> has status db_unfetched.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2565) MergeDB incorrectly handles unfetched CrawlDatums

2018-06-12 Thread Jurian Broertjes (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509573#comment-16509573
 ] 

Jurian Broertjes commented on NUTCH-2565:
-

One solution would be to sum the retries of both CrawlDatums. We could do this 
only for db_unfetched or for others aswell. What do you think would be 
appropriate?

 

 

> MergeDB incorrectly handles unfetched CrawlDatums
> -
>
> Key: NUTCH-2565
> URL: https://issues.apache.org/jira/browse/NUTCH-2565
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Jurian Broertjes
>Priority: Minor
>
> I ran into this issue when merging a crawlDB originating from sitemaps into 
> our normal crawlDB. CrawlDatums are merged based on output of 
> AbstractFetchSchedule::calculateLastFetchTime(). When CrawlDatums are 
> unfetched, this can overwrite fetchTime or other stuff.
> I assume this is a bug and have a simple fix for it that checks if CrawlDatum 
> has status db_unfetched.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2557) protocol-http fails to follow redirections when an HTTP response body is invalid

2018-06-12 Thread Omkar Reddy (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509469#comment-16509469
 ] 

Omkar Reddy commented on NUTCH-2557:


A simple and wise solution. Thanks. 

> protocol-http fails to follow redirections when an HTTP response body is 
> invalid
> 
>
> Key: NUTCH-2557
> URL: https://issues.apache.org/jira/browse/NUTCH-2557
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.14
>Reporter: Gerard Bouchar
>Priority: Major
> Fix For: 1.15
>
>
> If a server sends a redirection (3XX status code, with a Location header), 
> protocol-http tries to parse the HTTP response body anyway. Thus, if an error 
> occurs while decoding the body, the redirection is not followed and the 
> information is lost. Browsers follow the redirection and close the socket 
> soon as they can.
>  * Example: this page is a redirection to its https version, with an HTTP 
> body containing invalidly gzip encoded contents. Browsers follow the 
> redirection, but nutch throws an error:
>  ** [http://www.webarcelona.net/es/blog?page=2]
>  
> The HttpResponse::getContent class can already return null. I think it should 
> at least return null when parsing the HTTP response body fails.
> Ideally, we would adopt the same behavior as browsers, and not even try 
> parsing the body when the headers indicate a redirection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Nutch 1.14 issues

2018-06-12 Thread Arkadi.Kosmynin

Hi guys,

I am porting Arch (https://www.atnf.csiro.au/computing/software/arch/) to Nutch 
1.14 and Solr 7.2, and I have come across a few serious issues, of which you 
should be aware:


1.   The Nutch-2071 is still an issue in 1.14, because the returned 
parseResult is never null. If a parser fails to parse a document, it returns an 
empty result, but not null. This means that, from a chain of parser candidates, 
only the first one has a chance to try to parse the document.

2.   Nutch adopted Tika as a general parsing tool, and stopped supporting 
"legacy" parsing (OO, MS) plugins. I continued using them and hoped to stop 
supporting them in the next version of Arch I am preparing to be released, but 
I still can't do it, because Tika fails to parse too many documents on our 
site. But, when I reinforce Tika with the legacy parsers, I achieve almost 100% 
parsing success rate. This is why NUTCH-2071 is important for Arch. I think you 
should bring back legacy parsers to Nutch, because the quality of parsing of 
"real life" data, such as ours, is not great without them.

3.   The lines defining fall-back (*) plugin in parse-plugins.xml are not 
effective, because they are ignored, as long as there is at least one plugin 
claiming * in its plugin.xml file. In some cases, Nutch assigns * capability to 
plugins that don't even claim it. For example, I can't understand, why Arch 
content blocking plugin gets it.

4.   In earlier versions of Nutch, use of the native libraries really 
helped. It reduced crawling of our site from a couple of days to 6-7 hours. In 
Nutch 1.14, I don't notice this. I've obtained Hadoop libraries, placed them 
where they are expected, even inserted an explicit load library call in my 
code, but I still don't notice any significant time savings.

5.   The Feed plugin seems to have a major problem. The line 102 in  
FeedIndexingFilter.java generated a NumberFormatException (which caused the 
failure of the entire crawling process!) because it was trying to parse a date 
in string format, not a number. Given that this metadata piece was generated by 
the feed parser (same plugin), it seems that the plugin is in disagreement with 
itself.

6.   This is less important, but when Tika fails to parse a document, it 
generates a scary error message and ugly stack trace. I think this should be a 
one line warning, because other parsers may still parse this document 
successfully.

Hope this helps.

Regards,

Arkadi

60 matches

Mail list logo