[jira] [Commented] (NUTCH-3040) Upgrade to Hadoop 3.4.0

2024-04-11 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836191#comment-17836191
 ] 

Tim Allison commented on NUTCH-3040:


:cry-sob: This is great news!

> Upgrade to Hadoop 3.4.0
> ---
>
> Key: NUTCH-3040
> URL: https://issues.apache.org/jira/browse/NUTCH-3040
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
>
> [Hadoop 3.4.0|https://hadoop.apache.org/release/3.4.0.html] has been released.
> Many dependencies are upgraded, including commons-io 2.14.0 which would have 
> saved us a lot of work in NUTCH-2959.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2937) parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode

2024-04-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834532#comment-17834532
 ] 

Tim Allison commented on NUTCH-2937:


I really, really, really wish we didn't have to do this! :P

Happy to help!

> parse-tika: review dependency exclusions and avoid dependency conflicts in 
> distributed mode
> ---
>
> Key: NUTCH-2937
> URL: https://issues.apache.org/jira/browse/NUTCH-2937
> Project: Nutch
>  Issue Type: Bug
>  Components: parser, plugin
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Tim Allison
>Priority: Major
> Fix For: 1.20
>
>
> While testing NUTCH-2919 I've seen the following error caused by a 
> conflicting dependency to commons-io:
> - 2.11.0 Nutch core
> - 2.11.0 parse-tika (excluded to avoid duplicated dependencies)
> - 2.5 provided by Hadoop
> This causes errors parsing some office and other documents (but not all), for 
> example:
> {noformat}
> 2022-01-15 01:36:31,365 WARN [FetcherThread] 
> org.apache.nutch.parse.ParseUtil: Error parsing 
> http://kurskrun.ru/privacypolicy with org.apache.nutch.parse.tika.TikaParser
> java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: 
> 'org.apache.commons.io.input.CloseShieldInputStream 
> org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)'
> at 
> java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205)
> at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92)
> at 
> org.apache.nutch.fetcher.FetcherThread.output(FetcherThread.java:715)
> at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:431)
> Caused by: java.lang.NoSuchMethodError: 
> 'org.apache.commons.io.input.CloseShieldInputStream 
> org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)'
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:120)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:115)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
> at 
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151)
> at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90)
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34)
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:829)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (NUTCH-3026) Allow statusOnly option for IndexingJob

2024-03-15 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827510#comment-17827510
 ] 

Tim Allison edited comment on NUTCH-3026 at 3/15/24 2:18 PM:
-

Lost support for working on this issue. If anyone else wants to take it or 
finds a need, please re-open.


was (Author: talli...@mitre.org):
Lost support for working on this issue.

> Allow statusOnly option for IndexingJob
> ---
>
> Key: NUTCH-3026
> URL: https://issues.apache.org/jira/browse/NUTCH-3026
> Project: Nutch
>  Issue Type: New Feature
>        Reporter: Tim Allison
>Priority: Major
>
> This issue follows on from discussion here: 
> https://lists.apache.org/thread/mrnsc4hq0h4wovgdppfmrbyo2pxmjpyy
> I'd like to be able to run aggregations and other analytics on the current 
> status of a given crawl outside of Hadoop.
> There are different ways of going about this, and the title of this ticket 
> leads with my preference, but I'm opening this ticket for discussion.
> The goal would be to have an index with information per url on fetch status, 
> http status, parse status, possibly user selected parse metadata when it 
> exists.
> I want to be able to count 404s and other fetch issues (by host). I want to 
> be able to count parse exceptions, file types (by host), etc.
> I do not want to pollute my search index with content-less documents for 
> 404s/parse exceptions etc. I want two indices: one for crawl status and one 
> for search.
> Here are some options I see:
> Option 1: add a "statusOnly" option to the IndexingJob. This would 
> intentionally skip a bunch of the current logic that says "only send to the 
> index if there was a fetch success and there was a parse success and it isn't 
> a duplicate and ...". My proposal would not delete statuses in this index, 
> rather, the working assumption at least to start is that you'd run this on an 
> empty index to get a snapshot of the latest crawl data. We can look into 
> changing this in the future, but not on this ticket.
> Option 2: Copy/paste IndexingJob and then modify it and call it a whole other 
> tool
> Option 3: modify readdb or readseg to do roughly this, but it feels like each 
> one doesn't touch enough of the data components.
> Option 4: I can do effectively option 2 in a personal repo and not add more 
> code to Nutch.
> Other options?
> And, importantly, is there anyone else who would use this? Or is this really 
> only something that I'd want?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3026) Allow statusOnly option for IndexingJob

2024-03-15 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved NUTCH-3026.

Resolution: Won't Fix

Lost support for working on this issue.

> Allow statusOnly option for IndexingJob
> ---
>
> Key: NUTCH-3026
> URL: https://issues.apache.org/jira/browse/NUTCH-3026
> Project: Nutch
>  Issue Type: New Feature
>        Reporter: Tim Allison
>Priority: Major
>
> This issue follows on from discussion here: 
> https://lists.apache.org/thread/mrnsc4hq0h4wovgdppfmrbyo2pxmjpyy
> I'd like to be able to run aggregations and other analytics on the current 
> status of a given crawl outside of Hadoop.
> There are different ways of going about this, and the title of this ticket 
> leads with my preference, but I'm opening this ticket for discussion.
> The goal would be to have an index with information per url on fetch status, 
> http status, parse status, possibly user selected parse metadata when it 
> exists.
> I want to be able to count 404s and other fetch issues (by host). I want to 
> be able to count parse exceptions, file types (by host), etc.
> I do not want to pollute my search index with content-less documents for 
> 404s/parse exceptions etc. I want two indices: one for crawl status and one 
> for search.
> Here are some options I see:
> Option 1: add a "statusOnly" option to the IndexingJob. This would 
> intentionally skip a bunch of the current logic that says "only send to the 
> index if there was a fetch success and there was a parse success and it isn't 
> a duplicate and ...". My proposal would not delete statuses in this index, 
> rather, the working assumption at least to start is that you'd run this on an 
> empty index to get a snapshot of the latest crawl data. We can look into 
> changing this in the future, but not on this ticket.
> Option 2: Copy/paste IndexingJob and then modify it and call it a whole other 
> tool
> Option 3: modify readdb or readseg to do roughly this, but it feels like each 
> one doesn't touch enough of the data components.
> Option 4: I can do effectively option 2 in a personal repo and not add more 
> code to Nutch.
> Other options?
> And, importantly, is there anyone else who would use this? Or is this really 
> only something that I'd want?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3026) Allow statusOnly option for IndexingJob

2024-03-11 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825440#comment-17825440
 ] 

Tim Allison commented on NUTCH-3026:


I should close out the PR and this issue. With change in employment, I won't 
have time to work on this. I'll leave it open for a few more days in case 
someone wants to pick it up.

> Allow statusOnly option for IndexingJob
> ---
>
> Key: NUTCH-3026
> URL: https://issues.apache.org/jira/browse/NUTCH-3026
> Project: Nutch
>  Issue Type: New Feature
>        Reporter: Tim Allison
>Priority: Major
>
> This issue follows on from discussion here: 
> https://lists.apache.org/thread/mrnsc4hq0h4wovgdppfmrbyo2pxmjpyy
> I'd like to be able to run aggregations and other analytics on the current 
> status of a given crawl outside of Hadoop.
> There are different ways of going about this, and the title of this ticket 
> leads with my preference, but I'm opening this ticket for discussion.
> The goal would be to have an index with information per url on fetch status, 
> http status, parse status, possibly user selected parse metadata when it 
> exists.
> I want to be able to count 404s and other fetch issues (by host). I want to 
> be able to count parse exceptions, file types (by host), etc.
> I do not want to pollute my search index with content-less documents for 
> 404s/parse exceptions etc. I want two indices: one for crawl status and one 
> for search.
> Here are some options I see:
> Option 1: add a "statusOnly" option to the IndexingJob. This would 
> intentionally skip a bunch of the current logic that says "only send to the 
> index if there was a fetch success and there was a parse success and it isn't 
> a duplicate and ...". My proposal would not delete statuses in this index, 
> rather, the working assumption at least to start is that you'd run this on an 
> empty index to get a snapshot of the latest crawl data. We can look into 
> changing this in the future, but not on this ticket.
> Option 2: Copy/paste IndexingJob and then modify it and call it a whole other 
> tool
> Option 3: modify readdb or readseg to do roughly this, but it feels like each 
> one doesn't touch enough of the data components.
> Option 4: I can do effectively option 2 in a personal repo and not add more 
> code to Nutch.
> Other options?
> And, importantly, is there anyone else who would use this? Or is this really 
> only something that I'd want?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3026) Allow statusOnly option for IndexingJob

2023-12-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17794972#comment-17794972
 ] 

Tim Allison commented on NUTCH-3026:


Anyone have any time for feedback, even if only at a high level? Thank you!

> Allow statusOnly option for IndexingJob
> ---
>
> Key: NUTCH-3026
> URL: https://issues.apache.org/jira/browse/NUTCH-3026
> Project: Nutch
>  Issue Type: New Feature
>        Reporter: Tim Allison
>Priority: Major
>
> This issue follows on from discussion here: 
> https://lists.apache.org/thread/mrnsc4hq0h4wovgdppfmrbyo2pxmjpyy
> I'd like to be able to run aggregations and other analytics on the current 
> status of a given crawl outside of Hadoop.
> There are different ways of going about this, and the title of this ticket 
> leads with my preference, but I'm opening this ticket for discussion.
> The goal would be to have an index with information per url on fetch status, 
> http status, parse status, possibly user selected parse metadata when it 
> exists.
> I want to be able to count 404s and other fetch issues (by host). I want to 
> be able to count parse exceptions, file types (by host), etc.
> I do not want to pollute my search index with content-less documents for 
> 404s/parse exceptions etc. I want two indices: one for crawl status and one 
> for search.
> Here are some options I see:
> Option 1: add a "statusOnly" option to the IndexingJob. This would 
> intentionally skip a bunch of the current logic that says "only send to the 
> index if there was a fetch success and there was a parse success and it isn't 
> a duplicate and ...". My proposal would not delete statuses in this index, 
> rather, the working assumption at least to start is that you'd run this on an 
> empty index to get a snapshot of the latest crawl data. We can look into 
> changing this in the future, but not on this ticket.
> Option 2: Copy/paste IndexingJob and then modify it and call it a whole other 
> tool
> Option 3: modify readdb or readseg to do roughly this, but it feels like each 
> one doesn't touch enough of the data components.
> Option 4: I can do effectively option 2 in a personal repo and not add more 
> code to Nutch.
> Other options?
> And, importantly, is there anyone else who would use this? Or is this really 
> only something that I'd want?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3026) Allow statusOnly option for IndexingJob

2023-11-17 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17787372#comment-17787372
 ] 

Tim Allison commented on NUTCH-3026:


The above PR is a WIP for discussion. Let me know what you think.

> Allow statusOnly option for IndexingJob
> ---
>
> Key: NUTCH-3026
> URL: https://issues.apache.org/jira/browse/NUTCH-3026
> Project: Nutch
>  Issue Type: New Feature
>        Reporter: Tim Allison
>Priority: Major
>
> This issue follows on from discussion here: 
> https://lists.apache.org/thread/mrnsc4hq0h4wovgdppfmrbyo2pxmjpyy
> I'd like to be able to run aggregations and other analytics on the current 
> status of a given crawl outside of Hadoop.
> There are different ways of going about this, and the title of this ticket 
> leads with my preference, but I'm opening this ticket for discussion.
> The goal would be to have an index with information per url on fetch status, 
> http status, parse status, possibly user selected parse metadata when it 
> exists.
> I want to be able to count 404s and other fetch issues (by host). I want to 
> be able to count parse exceptions, file types (by host), etc.
> I do not want to pollute my search index with content-less documents for 
> 404s/parse exceptions etc. I want two indices: one for crawl status and one 
> for search.
> Here are some options I see:
> Option 1: add a "statusOnly" option to the IndexingJob. This would 
> intentionally skip a bunch of the current logic that says "only send to the 
> index if there was a fetch success and there was a parse success and it isn't 
> a duplicate and ...". My proposal would not delete statuses in this index, 
> rather, the working assumption at least to start is that you'd run this on an 
> empty index to get a snapshot of the latest crawl data. We can look into 
> changing this in the future, but not on this ticket.
> Option 2: Copy/paste IndexingJob and then modify it and call it a whole other 
> tool
> Option 3: modify readdb or readseg to do roughly this, but it feels like each 
> one doesn't touch enough of the data components.
> Option 4: I can do effectively option 2 in a personal repo and not add more 
> code to Nutch.
> Other options?
> And, importantly, is there anyone else who would use this? Or is this really 
> only something that I'd want?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3026) Allow statusOnly option for IndexingJob

2023-11-17 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated NUTCH-3026:
---
Description: 
This issue follows on from discussion here: 
https://lists.apache.org/thread/mrnsc4hq0h4wovgdppfmrbyo2pxmjpyy

I'd like to be able to run aggregations and other analytics on the current 
status of a given crawl outside of Hadoop.

There are different ways of going about this, and the title of this ticket 
leads with my preference, but I'm opening this ticket for discussion.

The goal would be to have an index with information per url on fetch status, 
http status, parse status, possibly user selected parse metadata when it exists.

I want to be able to count 404s and other fetch issues (by host). I want to be 
able to count parse exceptions, file types (by host), etc.

I do not want to pollute my search index with content-less documents for 
404s/parse exceptions etc. I want two indices: one for crawl status and one for 
search.

Here are some options I see:

Option 1: add a "statusOnly" option to the IndexingJob. This would 
intentionally skip a bunch of the current logic that says "only send to the 
index if there was a fetch success and there was a parse success and it isn't a 
duplicate and ...". My proposal would not delete statuses in this index, 
rather, the working assumption at least to start is that you'd run this on an 
empty index to get a snapshot of the latest crawl data. We can look into 
changing this in the future, but not on this ticket.

Option 2: Copy/paste IndexingJob and then modify it and call it a whole other 
tool

Option 3: modify readdb or readseg to do roughly this, but it feels like each 
one doesn't touch enough of the data components.

Option 4: I can do effectively option 2 in a personal repo and not add more 
code to Nutch.

Other options?

And, importantly, is there anyone else who would use this? Or is this really 
only something that I'd want?

  was:
This issue follows on from discussion here: 
https://lists.apache.org/thread/mrnsc4hq0h4wovgdppfmrbyo2pxmjpyy

I'd like to be able to run aggregations and other analytics on the current 
status of a given crawl outside of Hadoop.

There are different ways of going about this, and the title of this ticket 
leads with my preference, but I'm opening this ticket for discussion.

The goal would be to have an index with information per url on fetch status, 
http status, parse status, possibly user selected parse metadata when it exists.

I want to be able to count 404s and other fetch issues (by host). I want to be 
able to count parse exceptions, file types (by host), etc.

I do not want to pollute my search index with content-less documents for 
404s/parse exceptions etc. I want two indices: one for crawl status and one for 
search.

Here are some options I see:

Option 1: add a "statusOnly" option to the IndexingJob. This would 
intentionally skip a bunch of the current logic that says "only send to the 
index if there was a fetch success and there was a parse success and it isn't a 
duplicate and ...". My proposal would not delete statuses in this index, 
rather, the working assumption at least to start is that you'd run this on an 
empty index to get a snapshot of the latest crawl data. We can look into 
changing this in the future, but not on this ticket.

Option 2: Copy/paste IndexingJob and then modify it and call it a whole other 
tool

Option 3: modify readdb or readseg to do roughly this, but it feels like each 
one doesn't touch enough of the data components.

Option 4: I can do effectively option 2 in a personal repo and not add more 
code to Nutch.

Other options?


> Allow statusOnly option for IndexingJob
> ---
>
> Key: NUTCH-3026
> URL: https://issues.apache.org/jira/browse/NUTCH-3026
> Project: Nutch
>      Issue Type: New Feature
>Reporter: Tim Allison
>Priority: Major
>
> This issue follows on from discussion here: 
> https://lists.apache.org/thread/mrnsc4hq0h4wovgdppfmrbyo2pxmjpyy
> I'd like to be able to run aggregations and other analytics on the current 
> status of a given crawl outside of Hadoop.
> There are different ways of going about this, and the title of this ticket 
> leads with my preference, but I'm opening this ticket for discussion.
> The goal would be to have an index with information per url on fetch status, 
> http status, parse status, possibly user selected parse metadata when it 
> exists.
> I want to be able to count 404s and other fetch issues (by host). I want to 
> be able to count parse exceptions, file types (by host), etc.
> I do not want to pollute my search index with content-less documents for 
> 404s/parse exceptions etc. I want two indices: one

[jira] [Updated] (NUTCH-3026) Allow statusOnly option for IndexingJob

2023-11-17 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated NUTCH-3026:
---
Description: 
This issue follows on from discussion here: 
https://lists.apache.org/thread/mrnsc4hq0h4wovgdppfmrbyo2pxmjpyy

I'd like to be able to run aggregations and other analytics on the current 
status of a given crawl outside of Hadoop.

There are different ways of going about this, and the title of this ticket 
leads with my preference, but I'm opening this ticket for discussion.

The goal would be to have an index with information per url on fetch status, 
http status, parse status, possibly user selected parse metadata when it exists.

I want to be able to count 404s and other fetch issues (by host). I want to be 
able to count parse exceptions, file types (by host), etc.

I do not want to pollute my search index with content-less documents for 
404s/parse exceptions etc. I want two indices: one for crawl status and one for 
search.

Here are some options I see:

Option 1: add a "statusOnly" option to the IndexingJob. This would 
intentionally skip a bunch of the current logic that says "only send to the 
index if there was a fetch success and there was a parse success and it isn't a 
duplicate and ...". My proposal would not delete statuses in this index, 
rather, the working assumption at least to start is that you'd run this on an 
empty index to get a snapshot of the latest crawl data. We can look into 
changing this in the future, but not on this ticket.

Option 2: Copy/paste IndexingJob and then modify it and call it a whole other 
tool

Option 3: modify readdb or readseg to do roughly this, but it feels like each 
one doesn't touch enough of the data components.

Option 4: I can do effectively option 2 in a personal repo and not add more 
code to Nutch.

Other options?

  was:
This issue follows on from discussion here: 
https://lists.apache.org/thread/mrnsc4hq0h4wovgdppfmrbyo2pxmjpyy

I'd like to be able to run aggregations and other analytics on the current 
status of a given crawl outside of Hadoop.

There are different ways of going about this, and the title of this ticket 
leads with my preference, but I'm opening this ticket for discussion.

The goal would be to have an index with information per url on fetch status, 
http status, parse status, possibly user selected parse metadata when it exists.

I want to be able to count 404s and other fetch issues (by host). I want to be 
able to count parse exceptions, file types (by host), etc.

I do not want to pollute my search index with content-less documents for 
404s/parse exceptions etc. I want a separate index.

Here are some options I see:

Option 1: add a "statusOnly" option to the IndexingJob. This would 
intentionally skip a bunch of the current logic that says "only send to the 
index if there was a fetch success and there was a parse success and it isn't a 
duplicate and ...". My proposal would not delete statuses in this index, 
rather, the working assumption at least to start is that you'd run this on an 
empty index to get a snapshot of the latest crawl data. We can look into 
changing this in the future, but not on this ticket.

Option 2: Copy/paste IndexingJob and then modify it and call it a whole other 
tool

Option 3: modify readdb or readseg to do roughly this, but it feels like each 
one doesn't touch enough of the data components.

Option 4: I can do effectively option 2 in a personal repo and not add more 
code to Nutch.

Other options?


> Allow statusOnly option for IndexingJob
> ---
>
> Key: NUTCH-3026
> URL: https://issues.apache.org/jira/browse/NUTCH-3026
> Project: Nutch
>      Issue Type: New Feature
>Reporter: Tim Allison
>Priority: Major
>
> This issue follows on from discussion here: 
> https://lists.apache.org/thread/mrnsc4hq0h4wovgdppfmrbyo2pxmjpyy
> I'd like to be able to run aggregations and other analytics on the current 
> status of a given crawl outside of Hadoop.
> There are different ways of going about this, and the title of this ticket 
> leads with my preference, but I'm opening this ticket for discussion.
> The goal would be to have an index with information per url on fetch status, 
> http status, parse status, possibly user selected parse metadata when it 
> exists.
> I want to be able to count 404s and other fetch issues (by host). I want to 
> be able to count parse exceptions, file types (by host), etc.
> I do not want to pollute my search index with content-less documents for 
> 404s/parse exceptions etc. I want two indices: one for crawl status and one 
> for search.
> Here are some options I see:
> Option 1: add a "statusOnly" option to the IndexingJob.

[jira] [Updated] (NUTCH-3026) Allow statusOnly option for IndexingJob

2023-11-17 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated NUTCH-3026:
---
Issue Type: New Feature  (was: Task)

> Allow statusOnly option for IndexingJob
> ---
>
> Key: NUTCH-3026
> URL: https://issues.apache.org/jira/browse/NUTCH-3026
> Project: Nutch
>  Issue Type: New Feature
>        Reporter: Tim Allison
>Priority: Major
>
> This issue follows on from discussion here: 
> https://lists.apache.org/thread/mrnsc4hq0h4wovgdppfmrbyo2pxmjpyy
> I'd like to be able to run aggregations and other analytics on the current 
> status of a given crawl outside of Hadoop.
> There are different ways of going about this, and the title of this ticket 
> leads with my preference, but I'm opening this ticket for discussion.
> The goal would be to have an index with information per url on fetch status, 
> http status, parse status, possibly user selected parse metadata when it 
> exists.
> I want to be able to count 404s and other fetch issues (by host). I want to 
> be able to count parse exceptions, file types (by host), etc.
> I do not want to pollute my search index with content-less documents for 
> 404s/parse exceptions etc. I want a separate index.
> Here are some options I see:
> Option 1: add a "statusOnly" option to the IndexingJob. This would 
> intentionally skip a bunch of the current logic that says "only send to the 
> index if there was a fetch success and there was a parse success and it isn't 
> a duplicate and ...". My proposal would not delete statuses in this index, 
> rather, the working assumption at least to start is that you'd run this on an 
> empty index to get a snapshot of the latest crawl data. We can look into 
> changing this in the future, but not on this ticket.
> Option 2: Copy/paste IndexingJob and then modify it and call it a whole other 
> tool
> Option 3: modify readdb or readseg to do roughly this, but it feels like each 
> one doesn't touch enough of the data components.
> Option 4: I can do effectively option 2 in a personal repo and not add more 
> code to Nutch.
> Other options?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3026) Allow statusOnly option for IndexingJob

2023-11-17 Thread Tim Allison (Jira)
Tim Allison created NUTCH-3026:
--

 Summary: Allow statusOnly option for IndexingJob
 Key: NUTCH-3026
 URL: https://issues.apache.org/jira/browse/NUTCH-3026
 Project: Nutch
  Issue Type: Task
Reporter: Tim Allison


This issue follows on from discussion here: 
https://lists.apache.org/thread/mrnsc4hq0h4wovgdppfmrbyo2pxmjpyy

I'd like to be able to run aggregations and other analytics on the current 
status of a given crawl outside of Hadoop.

There are different ways of going about this, and the title of this ticket 
leads with my preference, but I'm opening this ticket for discussion.

The goal would be to have an index with information per url on fetch status, 
http status, parse status, possibly user selected parse metadata when it exists.

I want to be able to count 404s and other fetch issues (by host). I want to be 
able to count parse exceptions, file types (by host), etc.

I do not want to pollute my search index with content-less documents for 
404s/parse exceptions etc. I want a separate index.

Here are some options I see:

Option 1: add a "statusOnly" option to the IndexingJob. This would 
intentionally skip a bunch of the current logic that says "only send to the 
index if there was a fetch success and there was a parse success and it isn't a 
duplicate and ...". My proposal would not delete statuses in this index, 
rather, the working assumption at least to start is that you'd run this on an 
empty index to get a snapshot of the latest crawl data. We can look into 
changing this in the future, but not on this ticket.

Option 2: Copy/paste IndexingJob and then modify it and call it a whole other 
tool

Option 3: modify readdb or readseg to do roughly this, but it feels like each 
one doesn't touch enough of the data components.

Option 4: I can do effectively option 2 in a personal repo and not add more 
code to Nutch.

Other options?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3020) ParseSegment should check for protocol's flags for truncation

2023-11-06 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved NUTCH-3020.

Fix Version/s: 1.20
   Resolution: Fixed

> ParseSegment should check for protocol's flags for truncation
> -
>
> Key: NUTCH-3020
> URL: https://issues.apache.org/jira/browse/NUTCH-3020
> Project: Nutch
>  Issue Type: Task
>        Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.20
>
>
> As discussed on the user list, several protocols can identify when a fetch 
> has been truncated. ParseSegment only checks for the number of bytes fetched 
> vs the http length header (if it exists). We should modify ParseSegment to 
> check for notification of truncation from the protocols.
> I noticed this specifically with okhttp, but other protocols may flag 
> truncation as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3019) Upgrade to Apache Tika 2.9.1

2023-11-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17783352#comment-17783352
 ] 

Tim Allison commented on NUTCH-3019:


{noformat}
[junit] Tests run: 7, Failures: 4, Errors: 0, Skipped: 0, Time elapsed: 
2.271 sec
[junit] Test org.apache.nutch.protocol.httpclient.TestProtocolHttpClient 
FAILED 
{noformat}

???

> Upgrade to Apache Tika 2.9.1
> 
>
> Key: NUTCH-3019
> URL: https://issues.apache.org/jira/browse/NUTCH-3019
> Project: Nutch
>  Issue Type: Task
>        Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.20
>
>
> There's a commons-compress cve that affects Tika 2.9.0 (CVE-2023-42503). This 
> is fixed by an upgrade in Tika 2.9.1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3019) Upgrade to Apache Tika 2.9.1

2023-11-06 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved NUTCH-3019.

Fix Version/s: 1.20
   Resolution: Fixed

> Upgrade to Apache Tika 2.9.1
> 
>
> Key: NUTCH-3019
> URL: https://issues.apache.org/jira/browse/NUTCH-3019
> Project: Nutch
>  Issue Type: Task
>        Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.20
>
>
> There's a commons-compress cve that affects Tika 2.9.0 (CVE-2023-42503). This 
> is fixed by an upgrade in Tika 2.9.1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (NUTCH-3019) Upgrade to Apache Tika 2.9.1

2023-11-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17783254#comment-17783254
 ] 

Tim Allison edited comment on NUTCH-3019 at 11/6/23 3:46 PM:
-

tballison commented on PR #797:
URL: [https://github.com/apache/nutch/pull/797#issuecomment-1795161171]

```

2023-11-06T15:02:47.9408964Z [junit] Tests run: 14, Failures: 2, Errors: 0, 
Skipped: 4, Time elapsed: 4.342 sec
2023-11-06T15:02:48.2192793Z [junit] Test 
org.apache.nutch.protocol.okhttp.TestBadServerResponses FAILED

```


was (Author: githubbot):
tballison commented on PR #797:
URL: https://github.com/apache/nutch/pull/797#issuecomment-1795161171

   ```2023-11-06T15:02:47.9408964Z [junit] Tests run: 14, Failures: 2, 
Errors: 0, Skipped: 4, Time elapsed: 4.342 sec
   2023-11-06T15:02:48.2192793Z [junit] Test 
org.apache.nutch.protocol.okhttp.TestBadServerResponses FAILED```




> Upgrade to Apache Tika 2.9.1
> 
>
> Key: NUTCH-3019
> URL: https://issues.apache.org/jira/browse/NUTCH-3019
> Project: Nutch
>  Issue Type: Task
>        Reporter: Tim Allison
>Priority: Minor
>
> There's a commons-compress cve that affects Tika 2.9.0 (CVE-2023-42503). This 
> is fixed by an upgrade in Tika 2.9.1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (NUTCH-3019) Upgrade to Apache Tika 2.9.1

2023-11-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17783252#comment-17783252
 ] 

Tim Allison edited comment on NUTCH-3019 at 11/6/23 3:32 PM:
-

I just got this, which tracks with upgrade to 2.9.0.
{noformat}
 ParserStatus
        failed=84
        success=625{noformat}
https://issues.apache.org/jira/browse/NUTCH-2959?focusedCommentId=17771490=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17771490


was (Author: talli...@mitre.org):
ParserStatus
        failed=84
        success=625

> Upgrade to Apache Tika 2.9.1
> 
>
> Key: NUTCH-3019
> URL: https://issues.apache.org/jira/browse/NUTCH-3019
> Project: Nutch
>  Issue Type: Task
>        Reporter: Tim Allison
>Priority: Minor
>
> There's a commons-compress cve that affects Tika 2.9.0 (CVE-2023-42503). This 
> is fixed by an upgrade in Tika 2.9.1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3019) Upgrade to Apache Tika 2.9.1

2023-11-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17783252#comment-17783252
 ] 

Tim Allison commented on NUTCH-3019:


ParserStatus
        failed=84
        success=625

> Upgrade to Apache Tika 2.9.1
> 
>
> Key: NUTCH-3019
> URL: https://issues.apache.org/jira/browse/NUTCH-3019
> Project: Nutch
>  Issue Type: Task
>        Reporter: Tim Allison
>Priority: Minor
>
> There's a commons-compress cve that affects Tika 2.9.0 (CVE-2023-42503). This 
> is fixed by an upgrade in Tika 2.9.1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3021) Improve http-protocol to identify truncated content

2023-11-01 Thread Tim Allison (Jira)
Tim Allison created NUTCH-3021:
--

 Summary: Improve http-protocol to identify truncated content
 Key: NUTCH-3021
 URL: https://issues.apache.org/jira/browse/NUTCH-3021
 Project: Nutch
  Issue Type: Task
Reporter: Tim Allison


On the user list, [~snagel] noted that the http-protocol could flag truncated 
files as okhttp does currently. This would allow for more precise handling by 
ParseSegment (see NUTCH-3020).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3020) ParseSegment should check for protocol's flags for truncation

2023-11-01 Thread Tim Allison (Jira)
Tim Allison created NUTCH-3020:
--

 Summary: ParseSegment should check for protocol's flags for 
truncation
 Key: NUTCH-3020
 URL: https://issues.apache.org/jira/browse/NUTCH-3020
 Project: Nutch
  Issue Type: Task
Reporter: Tim Allison


As discussed on the user list, several protocols can identify when a fetch has 
been truncated. ParseSegment only checks for the number of bytes fetched vs the 
http length header (if it exists). We should modify ParseSegment to check for 
notification of truncation from the protocols.

I noticed this specifically with okhttp, but other protocols may flag 
truncation as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3018) Consider pooling remote webdrivers for Selenium?

2023-10-31 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated NUTCH-3018:
---
Description: 
It looks like it takes between 2x and 4x of the time to initialize the remote 
webdriver in selenium than it does to render/fetch a couple of test pages I'm 
working with.

On linux with a chrome driver, ~1.5 seconds to load the driver and then .5 of a 
second to fetch/render the page. On a mac, ~1.2 seconds to load and then .5 of 
a second to fetch/render.  

On a mac with firefox driver, ~3.7 seconds to load the driver and ~1 second to 
fetch/render a page.

Is it worth pooling webdrivers or does that add too much complexity/overhead?

  was:
It looks like it takes between 2x and 4x of the time to initialize the remote 
webdriver in selenium than it does to render/fetch a couple of test pages I'm 
working with.

On a mac with a chrome driver, ~1.5 seconds to load the driver and then .5 of a 
second to fetch/render the page. On a mac, ~1.2 seconds to load and then .5 of 
a second to fetch/render.  

On a mac with firefox driver, ~3.7 seconds to load the driver and ~1 second to 
fetch/render a page.

Is it worth pooling webdrivers or does that add too much complexity/overhead?


> Consider pooling remote webdrivers for Selenium?
> 
>
> Key: NUTCH-3018
> URL: https://issues.apache.org/jira/browse/NUTCH-3018
> Project: Nutch
>  Issue Type: Task
>        Reporter: Tim Allison
>Priority: Minor
>
> It looks like it takes between 2x and 4x of the time to initialize the remote 
> webdriver in selenium than it does to render/fetch a couple of test pages I'm 
> working with.
> On linux with a chrome driver, ~1.5 seconds to load the driver and then .5 of 
> a second to fetch/render the page. On a mac, ~1.2 seconds to load and then .5 
> of a second to fetch/render.  
> On a mac with firefox driver, ~3.7 seconds to load the driver and ~1 second 
> to fetch/render a page.
> Is it worth pooling webdrivers or does that add too much complexity/overhead?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (NUTCH-3018) Consider pooling remote webdrivers for Selenium?

2023-10-31 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17781485#comment-17781485
 ] 

Tim Allison edited comment on NUTCH-3018 at 10/31/23 6:55 PM:
--

On further reflection, what the above means is that if each of our threads 
creates its own web driver for every fetch, that means that the selenium 
instance is blocking the creation of these web-drivers until the current number 
of connections is less than the number of worker nodes TIMES 
SE_NODE_MAX_SESSIONS.

In short, we're already rate-limited by selenium.  We may as well rate limit 
ourselves and reuse drivers when we can?


was (Author: talli...@mitre.org):
On further reflection, what the above means is that if each of our threads 
creates its own web driver for every fetch, that means that the selenium 
instance is blocking the creation of these web-drivers until the current number 
of connections is < the number of worker nodes X SE_NODE_MAX_SESSIONS.

In short, we're already rate-limited by selenium.  We may as well rate limit 
ourselves and reuse drivers when we can?

> Consider pooling remote webdrivers for Selenium?
> 
>
> Key: NUTCH-3018
> URL: https://issues.apache.org/jira/browse/NUTCH-3018
> Project: Nutch
>  Issue Type: Task
>    Reporter: Tim Allison
>Priority: Minor
>
> It looks like it takes between 2x and 4x of the time to initialize the remote 
> webdriver in selenium than it does to render/fetch a couple of test pages I'm 
> working with.
> On a mac with a chrome driver, ~1.5 seconds to load the driver and then .5 of 
> a second to fetch/render the page. On a mac, ~1.2 seconds to load and then .5 
> of a second to fetch/render.  
> On a mac with firefox driver, ~3.7 seconds to load the driver and ~1 second 
> to fetch/render a page.
> Is it worth pooling webdrivers or does that add too much complexity/overhead?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3018) Consider pooling remote webdrivers for Selenium?

2023-10-31 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17781485#comment-17781485
 ] 

Tim Allison commented on NUTCH-3018:


On further reflection, what the above means is that if each of our threads 
creates its own web driver for every fetch, that means that the selenium 
instance is blocking the creation of these web-drivers until the current number 
of connections is < the number of worker nodes X SE_NODE_MAX_SESSIONS.

In short, we're already rate-limited by selenium.  We may as well rate limit 
ourselves and reuse drivers when we can?

> Consider pooling remote webdrivers for Selenium?
> 
>
> Key: NUTCH-3018
> URL: https://issues.apache.org/jira/browse/NUTCH-3018
> Project: Nutch
>  Issue Type: Task
>    Reporter: Tim Allison
>Priority: Minor
>
> It looks like it takes between 2x and 4x of the time to initialize the remote 
> webdriver in selenium than it does to render/fetch a couple of test pages I'm 
> working with.
> On a mac with a chrome driver, ~1.5 seconds to load the driver and then .5 of 
> a second to fetch/render the page. On a mac, ~1.2 seconds to load and then .5 
> of a second to fetch/render.  
> On a mac with firefox driver, ~3.7 seconds to load the driver and ~1 second 
> to fetch/render a page.
> Is it worth pooling webdrivers or does that add too much complexity/overhead?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (NUTCH-3018) Consider pooling remote webdrivers for Selenium?

2023-10-31 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17781483#comment-17781483
 ] 

Tim Allison edited comment on NUTCH-3018 at 10/31/23 6:46 PM:
--

It looks like we cannot create more web drivers than the number of worker nodes 
X {{SE_NODE_MAX_SESSIONS}}. 

I think it would still be useful to reuse the webdriver(s) if we can. We could 
reconnect on exception, etc...

This may be a horribly misguided approach.  Let me know. :D


was (Author: talli...@mitre.org):
It looks like we cannot create more web drivers than the 
{{SE_NODE_MAX_SESSIONS}} which defaults to 1. 

I think it would still be useful to reuse the webdriver(s) if we can. We could 
reconnect on exception, etc...

This may be a horribly misguided approach.  Let me know. :D

> Consider pooling remote webdrivers for Selenium?
> 
>
> Key: NUTCH-3018
> URL: https://issues.apache.org/jira/browse/NUTCH-3018
> Project: Nutch
>  Issue Type: Task
>        Reporter: Tim Allison
>Priority: Minor
>
> It looks like it takes between 2x and 4x of the time to initialize the remote 
> webdriver in selenium than it does to render/fetch a couple of test pages I'm 
> working with.
> On a mac with a chrome driver, ~1.5 seconds to load the driver and then .5 of 
> a second to fetch/render the page. On a mac, ~1.2 seconds to load and then .5 
> of a second to fetch/render.  
> On a mac with firefox driver, ~3.7 seconds to load the driver and ~1 second 
> to fetch/render a page.
> Is it worth pooling webdrivers or does that add too much complexity/overhead?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3018) Consider pooling remote webdrivers for Selenium?

2023-10-31 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17781483#comment-17781483
 ] 

Tim Allison commented on NUTCH-3018:


It looks like we cannot create more web drivers than the 
{{SE_NODE_MAX_SESSIONS}} which defaults to 1. 

I think it would still be useful to reuse the webdriver(s) if we can. We could 
reconnect on exception, etc...

This may be a horribly misguided approach.  Let me know. :D

> Consider pooling remote webdrivers for Selenium?
> 
>
> Key: NUTCH-3018
> URL: https://issues.apache.org/jira/browse/NUTCH-3018
> Project: Nutch
>  Issue Type: Task
>        Reporter: Tim Allison
>Priority: Minor
>
> It looks like it takes between 2x and 4x of the time to initialize the remote 
> webdriver in selenium than it does to render/fetch a couple of test pages I'm 
> working with.
> On a mac with a chrome driver, ~1.5 seconds to load the driver and then .5 of 
> a second to fetch/render the page. On a mac, ~1.2 seconds to load and then .5 
> of a second to fetch/render.  
> On a mac with firefox driver, ~3.7 seconds to load the driver and ~1 second 
> to fetch/render a page.
> Is it worth pooling webdrivers or does that add too much complexity/overhead?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3018) Consider pooling remote webdrivers for Selenium?

2023-10-31 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated NUTCH-3018:
---
Description: 
It looks like it takes between 2x and 4x of the time to initialize the remote 
webdriver in selenium than it does to render/fetch a couple of test pages I'm 
working with.

On a mac with a chrome driver, ~1.5 seconds to load the driver and then .5 of a 
second to fetch/render the page. On a mac, ~1.2 seconds to load and then .5 of 
a second to fetch/render.  

On a mac with firefox driver, ~3.7 seconds to load the driver and ~1 second to 
fetch/render a page.

Is it worth pooling webdrivers or does that add too much complexity/overhead?

  was:
It looks like it takes between 2x and 4x of the time to initialize the remote 
webdriver in selenium than it does to render/fetch a couple of test pages I'm 
working with.

On linux with a chrome driver, ~1.5 seconds to load the driver and then .5 of a 
second to fetch/render the page. On a mac, ~1.2 seconds to load and then .5 of 
a second to fetch/render.  I think the delta is greater for firefox.

Is it worth pooling webdrivers or does that add too much complexity/overhead?


> Consider pooling remote webdrivers for Selenium?
> 
>
> Key: NUTCH-3018
> URL: https://issues.apache.org/jira/browse/NUTCH-3018
> Project: Nutch
>  Issue Type: Task
>        Reporter: Tim Allison
>Priority: Minor
>
> It looks like it takes between 2x and 4x of the time to initialize the remote 
> webdriver in selenium than it does to render/fetch a couple of test pages I'm 
> working with.
> On a mac with a chrome driver, ~1.5 seconds to load the driver and then .5 of 
> a second to fetch/render the page. On a mac, ~1.2 seconds to load and then .5 
> of a second to fetch/render.  
> On a mac with firefox driver, ~3.7 seconds to load the driver and ~1 second 
> to fetch/render a page.
> Is it worth pooling webdrivers or does that add too much complexity/overhead?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3019) Upgrade to Apache Tika 2.9.1

2023-10-31 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17781482#comment-17781482
 ] 

Tim Allison commented on NUTCH-3019:


Separately, I noticed that logging from Tika was not working when running 
locally. It looks like we need to add the log4j2 jars into the shim.

> Upgrade to Apache Tika 2.9.1
> 
>
> Key: NUTCH-3019
> URL: https://issues.apache.org/jira/browse/NUTCH-3019
> Project: Nutch
>  Issue Type: Task
>        Reporter: Tim Allison
>Priority: Minor
>
> There's a commons-compress cve that affects Tika 2.9.0 (CVE-2023-42503). This 
> is fixed by an upgrade in Tika 2.9.1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3019) Upgrade to Apache Tika 2.9.1

2023-10-31 Thread Tim Allison (Jira)
Tim Allison created NUTCH-3019:
--

 Summary: Upgrade to Apache Tika 2.9.1
 Key: NUTCH-3019
 URL: https://issues.apache.org/jira/browse/NUTCH-3019
 Project: Nutch
  Issue Type: Task
Reporter: Tim Allison


There's a commons-compress cve that affects Tika 2.9.0 (CVE-2023-42503). This 
is fixed by an upgrade in Tika 2.9.1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-2959) Upgrade to Apache Tika 2.9.0

2023-10-31 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved NUTCH-2959.

Resolution: Fixed

> Upgrade to Apache Tika 2.9.0
> 
>
> Key: NUTCH-2959
> URL: https://issues.apache.org/jira/browse/NUTCH-2959
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-2959.patch
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3018) Consider pooling remote webdrivers for Selenium?

2023-10-31 Thread Tim Allison (Jira)
Tim Allison created NUTCH-3018:
--

 Summary: Consider pooling remote webdrivers for Selenium?
 Key: NUTCH-3018
 URL: https://issues.apache.org/jira/browse/NUTCH-3018
 Project: Nutch
  Issue Type: Task
Reporter: Tim Allison


It looks like it takes between 2x and 4x of the time to initialize the remote 
webdriver in selenium than it does to render/fetch a couple of test pages I'm 
working with.

On linux with a chrome driver, ~1.5 seconds to load the driver and then .5 of a 
second to fetch/render the page. On a mac, ~1.2 seconds to load and then .5 of 
a second to fetch/render.  I think the delta is greater for firefox.

Is it worth pooling webdrivers or does that add too much complexity/overhead?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.9.0

2023-10-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17771476#comment-17771476
 ] 

Tim Allison commented on NUTCH-2959:


If you and the Nutch team are ok with the shim, I'll work towards that. 

The challenge is that Hadoop 3.4.0 as it is currently configured will fail with 
the latest POI, which means that we can get up to Tika 2.9.0 whenever that is 
available, but will then be blocked until Hadoop upgrades commons-io in 3.5.0.

I could open a ticket to upgrade commons-io in Hadoop to POI's version and hope 
that that mod is accepted into 3.4.0, but the pushback on the earlier upgrade 
suggests that the Hadoop team may not be open to the upgrade.



> Upgrade to Apache Tika 2.9.0
> 
>
> Key: NUTCH-2959
> URL: https://issues.apache.org/jira/browse/NUTCH-2959
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-2959.patch
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (NUTCH-2959) Upgrade to Apache Tika 2.9.0

2023-10-02 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17771170#comment-17771170
 ] 

Tim Allison edited comment on NUTCH-2959 at 10/2/23 3:51 PM:
-

I've continued to stub my toes on this this morning.

The best option, which I acknowledge might not be acceptable, seems to be to 
create a separate (temporary!) shim project that shades commons-io for Tika and 
POI and removes xerces/xml-apis.

The shaded fat tika-app jar didn't work because of xerces/xml-apis.  I could 
have done some ugly jar rewriting in ant to delete org/apache/xerces etc., but 
that felt really awful.

The current shim project is here: https://github.com/tballison/hadoop-safe-tika

If this is something we want to pursue, I can run through the full tests etc 
and then publish to maven central.  I also have to add the language detector.  
The repo is purely proof of concept and shouldn't even be built/tested locally 
yet.

The goal would be to use this until Apache Tika, Apache POI and Apache Hadoop 
can all get to a compatible version of commons-io.

This solution would allow us to avoid the messy shading of commons-io in 
tika-app on the actual Apache Tika project.

WDYT?


was (Author: talli...@mitre.org):
I've continued to stub my toes on this this morning.

The best option, which I realize might not be acceptable, seems to be to create 
a separate (temporary!) shim project that shades commons-io for Tika and POI 
and removes xerces/xml-apis.

The shaded fat tika-app jar didn't work because of xerces/xml-apis.

The current shim project is here: https://github.com/tballison/hadoop-safe-tika

If this is something we want to pursue, I can run through the full tests etc 
and then publish to maven central.  I also have to add the language detector.  
The repo is purely proof of concept and shouldn't even be built/tested locally 
yet.

The goal would be to use this until Apache Tika, Apache POI and Apache Hadoop 
can all get to a compatible version of commons-io.

This solution would allow us to avoid the messy shading of commons-io in 
tika-app on the actual Apache Tika project.

WDYT?

> Upgrade to Apache Tika 2.9.0
> 
>
> Key: NUTCH-2959
> URL: https://issues.apache.org/jira/browse/NUTCH-2959
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-2959.patch
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.9.0

2023-10-02 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17771170#comment-17771170
 ] 

Tim Allison commented on NUTCH-2959:


I've continued to stub my toes on this this morning.

The best option, which I realize might not be acceptable, seems to be to create 
a separate (temporary!) shim project that shades commons-io for Tika and POI 
and removes xerces/xml-apis.

The shaded fat tika-app jar didn't work because of xerces/xml-apis.

The current shim project is here: https://github.com/tballison/hadoop-safe-tika

If this is something we want to pursue, I can run through the full tests etc 
and then publish to maven central.  I also have to add the language detector.  
The repo is purely proof of concept and shouldn't even be built/tested locally 
yet.

The goal would be to use this until Apache Tika, Apache POI and Apache Hadoop 
can all get to a compatible version of commons-io.

This solution would allow us to avoid the messy shading of commons-io in 
tika-app on the actual Apache Tika project.

WDYT?

> Upgrade to Apache Tika 2.9.0
> 
>
> Key: NUTCH-2959
> URL: https://issues.apache.org/jira/browse/NUTCH-2959
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-2959.patch
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Establishing a Nutch development roadmap

2023-09-28 Thread Tim Allison
Sorry for two emails...

Migrating javax->jakarta has been quite a chore on Tika because of
dependencies. Given back-compat issues with hadoop, is this even on the
horizon for Nutch?

On Thu, Sep 28, 2023 at 9:29 AM Tim Allison  wrote:

> Y, I'd like to get a working Tika version in a release fairly soon. Not
> sure how much effort a release is?
>
>
> On Thu, Sep 28, 2023 at 8:29 AM Sebastian Nagel  wrote:
>
>> Hi Lewis,
>>
>> thanks!
>>
>> I'd put on top of the list
>>
>> * release 1.20
>>
>> Since the release of 1.19 more than one year has elapsed.
>>
>> Otherwise I agree with all points on the road map, even
>> in this order / priority.
>>
>> Best,
>> Sebastian
>>
>>
>> On 9/26/23 18:37, lewis john mcgibbney wrote:
>> > Hi dev@,
>> >
>> > I've been at arms length for a while as $dayjob changed and then
>> > changed again over the last number of years.
>> >
>> > With that being said, I wanted to start a thread on $title with the
>> > goal of establishing some "big items" we could put on the roadmap and
>> > maybe even publish...
>> >
>> > Here are some of the thing's I've been thinking about (unordered)
>> >
>> > * NUTCH-2940 Develop Gradle Core Build for Apache Nutch
>> > * Metrics system integration cf.
>> https://github.com/apache/nutch/pull/712
>> > * Upgrading Javac version > 11
>> > * Trade study to consider integrating (something like) Plugin
>> > Framework for Java (PF4J) into Nutch
>> > * porting Nutch to run on Apache Beam https://beam.apache.org/
>> >
>> > Does anyone else have candidates they wish to add?
>> >
>> > Thanks for your consideration.
>> >
>> > lewismc
>> >
>> >
>>
>


Re: Establishing a Nutch development roadmap

2023-09-28 Thread Tim Allison
Y, I'd like to get a working Tika version in a release fairly soon. Not
sure how much effort a release is?


On Thu, Sep 28, 2023 at 8:29 AM Sebastian Nagel  wrote:

> Hi Lewis,
>
> thanks!
>
> I'd put on top of the list
>
> * release 1.20
>
> Since the release of 1.19 more than one year has elapsed.
>
> Otherwise I agree with all points on the road map, even
> in this order / priority.
>
> Best,
> Sebastian
>
>
> On 9/26/23 18:37, lewis john mcgibbney wrote:
> > Hi dev@,
> >
> > I've been at arms length for a while as $dayjob changed and then
> > changed again over the last number of years.
> >
> > With that being said, I wanted to start a thread on $title with the
> > goal of establishing some "big items" we could put on the roadmap and
> > maybe even publish...
> >
> > Here are some of the thing's I've been thinking about (unordered)
> >
> > * NUTCH-2940 Develop Gradle Core Build for Apache Nutch
> > * Metrics system integration cf.
> https://github.com/apache/nutch/pull/712
> > * Upgrading Javac version > 11
> > * Trade study to consider integrating (something like) Plugin
> > Framework for Java (PF4J) into Nutch
> > * porting Nutch to run on Apache Beam https://beam.apache.org/
> >
> > Does anyone else have candidates they wish to add?
> >
> > Thanks for your consideration.
> >
> > lewismc
> >
> >
>


[jira] [Commented] (NUTCH-3006) Downgrade Tika dependency to 2.2.1 (core and parse-tika)

2023-09-28 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770059#comment-17770059
 ] 

Tim Allison commented on NUTCH-3006:


An alternative approach would be for Tika to revert 
CloseShieldInputStream.wrap(), which I think was the only conflict?!  Should I 
check with the Tika community about that?

The notion of downgrading Tika to a December 2021 release unsettles me, and I 
have no idea how far out Hadoop 3.4.0 is.

WDYT?

> Downgrade Tika dependency to 2.2.1 (core and parse-tika)
> 
>
> Key: NUTCH-3006
> URL: https://issues.apache.org/jira/browse/NUTCH-3006
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Tika 2.3.0 and upwards depend on a commons-io 2.11.0 (or even higher) which 
> is not available when Nutch is used on Hadoop. Only Hadoop 3.4.0 is expected 
> to ship with commons-io 2.11.0 (HADOOP-18301), all currently released 
> versions provide commons-io 2.8.0. Because Hadoop-required dependencies are 
> enforced in (pseudo)distributed mode, using Tika may cause issues, see 
> NUTCH-2937 and NUTCH-2959.
> [~lewismc] suggested in the discussion of [Githup PR 
> #776|https://github.com/apache/nutch/pull/776] to downgrade to Tika 2.2.1 to 
> resolve these issues for now and until Hadoop 3.4.0 becomes available.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3005) Upgrade selenium as needed

2023-09-26 Thread Tim Allison (Jira)
Tim Allison created NUTCH-3005:
--

 Summary: Upgrade selenium as needed
 Key: NUTCH-3005
 URL: https://issues.apache.org/jira/browse/NUTCH-3005
 Project: Nutch
  Issue Type: Improvement
Reporter: Tim Allison


When we choose to upgrade selenium, we should take note of this blog about 
changes in headless chromium: 
https://www.selenium.dev/blog/2023/headless-is-going-away/

ChromeOptions options = new ChromeOptions();
options.addArguments("--headless=new");
WebDriver driver = new ChromeDriver(options);
driver.get("https://selenium.dev;);
driver.quit();



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3004) Avoid NPE in HttpResponse

2023-09-26 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved NUTCH-3004.

Resolution: Fixed

> Avoid NPE in HttpResponse
> -
>
> Key: NUTCH-3004
> URL: https://issues.apache.org/jira/browse/NUTCH-3004
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, protocol
>Affects Versions: 1.19
>    Reporter: Tim Allison
>Priority: Trivial
> Fix For: 1.20
>
>
> I recently deployed nutch on a FIPS enabled rhel 8 instance, and I got an NPE 
> in HttpResponse.  When I set the log level to debug, I could see what was 
> happening, but it would have been better to get a meaningful exception rather 
> than an NPE.
> The issue is that in the catch clause, the exception is propagated only if 
> the message is "handshake alert..." and then the reconnect fails.  If the 
> message is not that, then the ssl socket remains null, and we get an NPE 
> below the source I quote here.
> I think we should throw the same HTTPException that we do throw in the nested 
> try if the message is not "handshake alert..."
> {code:java}
> try {
>   sslsocket = getSSLSocket(socket, sockHost, sockPort);
>   sslsocket.startHandshake();
> } catch (Exception e) {
>   Http.LOG.debug("SSL connection to {} failed with: {}", url,
>   e.getMessage());
>   if ("handshake alert:  unrecognized_name".equals(e.getMessage())) {
> try {
>   // Reconnect, see NUTCH-2447
>   socket = new Socket();
>   socket.setSoTimeout(http.getTimeout());
>   socket.connect(sockAddr, http.getTimeout());
>   sslsocket = getSSLSocket(socket, "", sockPort);
>   sslsocket.startHandshake();
> } catch (Exception ex) {
>   String msg = "SSL reconnect to " + url + " failed with: "
>   + e.getMessage();
>   throw new HttpException(msg);
> }
>   }
> }
> socket = sslsocket;
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3004) Avoid NPE in HttpResponse

2023-09-25 Thread Tim Allison (Jira)
Tim Allison created NUTCH-3004:
--

 Summary: Avoid NPE in HttpResponse
 Key: NUTCH-3004
 URL: https://issues.apache.org/jira/browse/NUTCH-3004
 Project: Nutch
  Issue Type: Improvement
Reporter: Tim Allison


I recently deployed nutch on a FIPS enabled rhel 8 instance, and I got an NPE 
in HttpResponse.  When I set the log level to debug, I could see what was 
happening, but it would have been better to get a meaningful exception rather 
than an NPE.

The issue is that in the catch clause, the exception is propagated only if the 
message is "handshake alert..." and then the reconnect fails.  If the message 
is not that, then the ssl socket remains null, and we get an NPE below the 
source I quote here.

I think we should throw the same HTTPException that we do throw in the nested 
try if the message is not "handshake alert..."


{code:java}
try {
  sslsocket = getSSLSocket(socket, sockHost, sockPort);
  sslsocket.startHandshake();
} catch (Exception e) {
  Http.LOG.debug("SSL connection to {} failed with: {}", url,
  e.getMessage());
  if ("handshake alert:  unrecognized_name".equals(e.getMessage())) {
try {
  // Reconnect, see NUTCH-2447
  socket = new Socket();
  socket.setSoTimeout(http.getTimeout());
  socket.connect(sockAddr, http.getTimeout());
  sslsocket = getSSLSocket(socket, "", sockPort);
  sslsocket.startHandshake();
} catch (Exception ex) {
  String msg = "SSL reconnect to " + url + " failed with: "
  + e.getMessage();
  throw new HttpException(msg);
}
  }
}
socket = sslsocket;
  }

{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2937) parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode

2023-09-19 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766832#comment-17766832
 ] 

Tim Allison commented on NUTCH-2937:


As [~snagel] pointed out on the PR for NUTCH-2959 -- looks like we have to wait 
for Hadoop 3.4.0: https://issues.apache.org/jira/browse/HADOOP-18301 :(

Unless we revert the .wrap() in Tika in, say, 2.9.1?  Yuck...

> parse-tika: review dependency exclusions and avoid dependency conflicts in 
> distributed mode
> ---
>
> Key: NUTCH-2937
> URL: https://issues.apache.org/jira/browse/NUTCH-2937
> Project: Nutch
>  Issue Type: Bug
>  Components: parser, plugin
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> While testing NUTCH-2919 I've seen the following error caused by a 
> conflicting dependency to commons-io:
> - 2.11.0 Nutch core
> - 2.11.0 parse-tika (excluded to avoid duplicated dependencies)
> - 2.5 provided by Hadoop
> This causes errors parsing some office and other documents (but not all), for 
> example:
> {noformat}
> 2022-01-15 01:36:31,365 WARN [FetcherThread] 
> org.apache.nutch.parse.ParseUtil: Error parsing 
> http://kurskrun.ru/privacypolicy with org.apache.nutch.parse.tika.TikaParser
> java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: 
> 'org.apache.commons.io.input.CloseShieldInputStream 
> org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)'
> at 
> java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205)
> at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92)
> at 
> org.apache.nutch.fetcher.FetcherThread.output(FetcherThread.java:715)
> at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:431)
> Caused by: java.lang.NoSuchMethodError: 
> 'org.apache.commons.io.input.CloseShieldInputStream 
> org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)'
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:120)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:115)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
> at 
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151)
> at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90)
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34)
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:829)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3003) Consider integration testing in a Dockerized mini-hadoop cluster via testcontainers?

2023-09-19 Thread Tim Allison (Jira)
Tim Allison created NUTCH-3003:
--

 Summary: Consider integration testing in a Dockerized mini-hadoop 
cluster via testcontainers?
 Key: NUTCH-3003
 URL: https://issues.apache.org/jira/browse/NUTCH-3003
 Project: Nutch
  Issue Type: Wish
Reporter: Tim Allison


I don't think I'll have the time to do this any time soon, but this might help 
lead the way:

https://github.com/ooraini/testcontainers-hdfs

We've started using testcontainers in unit tests on the Tika project for 
OpenSearch, Kafka and Solr, and it has been invaluable.

I have no idea how much effort it would take to get this running in Nutch with 
ant etc...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2023-09-17 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved NUTCH-2978.

Fix Version/s: 1.20
   Resolution: Fixed

Many thanks [~markus17] for all of the work on this!  I really didn't do much 
beyond your initial patches!

Many thanks to [~snagel] for confirming that it works on an actual hadoop 
cluster!

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-2978-1.patch, NUTCH-2978-2.patch, 
> NUTCH-2978-3.patch, NUTCH-2978-any23.patch, NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2959) Upgrade to Apache Tika 2.9.0

2023-09-14 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated NUTCH-2959:
---
Summary: Upgrade to Apache Tika 2.9.0  (was: Upgrade to Apache Tika 2.4.1)

> Upgrade to Apache Tika 2.9.0
> 
>
> Key: NUTCH-2959
> URL: https://issues.apache.org/jira/browse/NUTCH-2959
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-2959.patch
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.4.1

2023-09-14 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765306#comment-17765306
 ] 

Tim Allison commented on NUTCH-2959:


Currently working on this to bump to Tika 2.9.0.  PR incoming once I get a 
clean build.

> Upgrade to Apache Tika 2.4.1
> 
>
> Key: NUTCH-2959
> URL: https://issues.apache.org/jira/browse/NUTCH-2959
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-2959.patch
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-2998) Remove the Any23 plugin

2023-09-14 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved NUTCH-2998.

Fix Version/s: 1.20
   Resolution: Fixed

> Remove the Any23 plugin
> ---
>
> Key: NUTCH-2998
> URL: https://issues.apache.org/jira/browse/NUTCH-2998
> Project: Nutch
>  Issue Type: Task
>  Components: any23
>    Reporter: Tim Allison
>Priority: Major
> Fix For: 1.20
>
>
> I'm not sure how we want to handle this.  Any23 moved to the Attic in June 
> 2023.  We should probably remove it from Nutch?  I'm not sure how abruptly we 
> want to do that.
> We could deprecate it for 1.20 and then remove it in 1.21 or later?  Or we 
> could choose to remove it for 1.20.
> What do you think?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3000) protocol-selenium returns only the body,strips off the element

2023-09-13 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved NUTCH-3000.

Fix Version/s: 1.20
   Resolution: Fixed

> protocol-selenium returns only the body,strips off the  element
> --
>
> Key: NUTCH-3000
> URL: https://issues.apache.org/jira/browse/NUTCH-3000
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>    Reporter: Tim Allison
>Priority: Major
> Fix For: 1.20
>
>
> The selenium protocol returns only the body portion of the html, which means 
> that neither the title nor the other page metadata in the  section 
> gets extracted.
> {noformat}
> String innerHtml = driver.findElement(By.tagName("body"))
> .getAttribute("innerHTML");
> {noformat}
> We should return the full html, no?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3001) protocol-selenium requires Content-Type header

2023-09-13 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved NUTCH-3001.

Fix Version/s: 1.20
   Resolution: Fixed

> protocol-selenium requires Content-Type header 
> ---
>
> Key: NUTCH-3001
> URL: https://issues.apache.org/jira/browse/NUTCH-3001
> Project: Nutch
>  Issue Type: Bug
>        Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.20
>
>
> It looks like the selenium protocol requires that there be a content-type 
> header. 
> The logic seems to be: If the content type is html or xhtml, use selenium, 
> otherwise just grab the bytes.  
> However, with the current logic, if the content-type is null, nothing is 
> pulled.  
> My guess is that the logic should be : if the content type is not null and 
> equals html or xhtml use selenium, otherwise grab the bytes.
> Right?
> {noformat}
>   String contentType = getHeader(Response.CONTENT_TYPE);
>   // handle with Selenium only if content type in HTML or XHTML
>   if (contentType != null) {
>  if (contentType.contains("text/html")
> || contentType.contains("application/xhtml")) {
>readPlainContent(url);
>  } else {
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2998) Remove the Any23 plugin

2023-09-13 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764741#comment-17764741
 ] 

Tim Allison commented on NUTCH-2998:


Sorry, I botched the title in the PR: https://github.com/apache/nutch/pull/775

> Remove the Any23 plugin
> ---
>
> Key: NUTCH-2998
> URL: https://issues.apache.org/jira/browse/NUTCH-2998
> Project: Nutch
>  Issue Type: Task
>  Components: any23
>    Reporter: Tim Allison
>Priority: Major
>
> I'm not sure how we want to handle this.  Any23 moved to the Attic in June 
> 2023.  We should probably remove it from Nutch?  I'm not sure how abruptly we 
> want to do that.
> We could deprecate it for 1.20 and then remove it in 1.21 or later?  Or we 
> could choose to remove it for 1.20.
> What do you think?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[DISCUSS] Removing Any23 from Nutch?

2023-09-13 Thread Tim Allison
All,
  I opened https://issues.apache.org/jira/browse/NUTCH-2998 a few weeks
ago.  Any23 was moved to the attic in June. Unless there are objections, I
propose removing it from Nutch before the next release.
  Any objections?

   Best,

   Tim


[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2023-09-13 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764705#comment-17764705
 ] 

Tim Allison commented on NUTCH-2978:


I haven't tested in hadoop. I've just run it locally, and, for the modules I'm 
using, it seems to work.

Please, please, please help test it more broadly!

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978-1.patch, NUTCH-2978-2.patch, 
> NUTCH-2978-3.patch, NUTCH-2978-any23.patch, NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3001) protocol-selenium requires Content-Type header

2023-09-13 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated NUTCH-3001:
---
Description: 
It looks like the selenium protocol requires that there be a content-type 
header. 

The logic seems to be: If the content type is html or xhtml, use selenium, 
otherwise just grab the bytes.  

However, with the current logic, if the content-type is null, nothing is 
pulled.  

My guess is that the logic should be : if the content type is not null and 
equals html or xhtml use selenium, otherwise grab the bytes.

Right?

{noformat}
  String contentType = getHeader(Response.CONTENT_TYPE);

  // handle with Selenium only if content type in HTML or XHTML
  if (contentType != null) {
 if (contentType.contains("text/html")
|| contentType.contains("application/xhtml")) {
   readPlainContent(url);
 } else {
...
{noformat}

  was:
It looks like the selenium protocol requires that there be a content-type 
header. 

The logic seems to be: If the content type is html or xhtml, use selenium, 
otherwise just grab the bytes.  

However, with the current logic, if the content-type is null, nothing is 
pulled.  

My guess is that the logic should be : if the content type is not null and 
equals html or xhtml use selenium, otherwise grab the bytes.

Right?

{noformat}
  String contentType = getHeader(Response.CONTENT_TYPE);

  // handle with Selenium only if content type in HTML or XHTML
  if (contentType != null) {
{noformat}


> protocol-selenium requires Content-Type header 
> ---
>
> Key: NUTCH-3001
> URL: https://issues.apache.org/jira/browse/NUTCH-3001
> Project: Nutch
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Major
>
> It looks like the selenium protocol requires that there be a content-type 
> header. 
> The logic seems to be: If the content type is html or xhtml, use selenium, 
> otherwise just grab the bytes.  
> However, with the current logic, if the content-type is null, nothing is 
> pulled.  
> My guess is that the logic should be : if the content type is not null and 
> equals html or xhtml use selenium, otherwise grab the bytes.
> Right?
> {noformat}
>   String contentType = getHeader(Response.CONTENT_TYPE);
>   // handle with Selenium only if content type in HTML or XHTML
>   if (contentType != null) {
>  if (contentType.contains("text/html")
> || contentType.contains("application/xhtml")) {
>readPlainContent(url);
>  } else {
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3001) protocol-selenium requires Content-Type header

2023-09-13 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated NUTCH-3001:
---
Priority: Minor  (was: Major)

> protocol-selenium requires Content-Type header 
> ---
>
> Key: NUTCH-3001
> URL: https://issues.apache.org/jira/browse/NUTCH-3001
> Project: Nutch
>  Issue Type: Bug
>        Reporter: Tim Allison
>Priority: Minor
>
> It looks like the selenium protocol requires that there be a content-type 
> header. 
> The logic seems to be: If the content type is html or xhtml, use selenium, 
> otherwise just grab the bytes.  
> However, with the current logic, if the content-type is null, nothing is 
> pulled.  
> My guess is that the logic should be : if the content type is not null and 
> equals html or xhtml use selenium, otherwise grab the bytes.
> Right?
> {noformat}
>   String contentType = getHeader(Response.CONTENT_TYPE);
>   // handle with Selenium only if content type in HTML or XHTML
>   if (contentType != null) {
>  if (contentType.contains("text/html")
> || contentType.contains("application/xhtml")) {
>readPlainContent(url);
>  } else {
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3001) protocol-selenium requires Content-Type header

2023-09-13 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764698#comment-17764698
 ] 

Tim Allison commented on NUTCH-3001:


Or is the notion that if the selenium protocol doesn't pull any bytes, a 
backoff http protocol is applied outside of the selenium protocol?

> protocol-selenium requires Content-Type header 
> ---
>
> Key: NUTCH-3001
> URL: https://issues.apache.org/jira/browse/NUTCH-3001
> Project: Nutch
>  Issue Type: Bug
>        Reporter: Tim Allison
>Priority: Minor
>
> It looks like the selenium protocol requires that there be a content-type 
> header. 
> The logic seems to be: If the content type is html or xhtml, use selenium, 
> otherwise just grab the bytes.  
> However, with the current logic, if the content-type is null, nothing is 
> pulled.  
> My guess is that the logic should be : if the content type is not null and 
> equals html or xhtml use selenium, otherwise grab the bytes.
> Right?
> {noformat}
>   String contentType = getHeader(Response.CONTENT_TYPE);
>   // handle with Selenium only if content type in HTML or XHTML
>   if (contentType != null) {
>  if (contentType.contains("text/html")
> || contentType.contains("application/xhtml")) {
>readPlainContent(url);
>  } else {
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3001) protocol-selenium requires Content-Type header

2023-09-13 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated NUTCH-3001:
---
Description: 
It looks like the selenium protocol requires that there be a content-type 
header. 

The logic seems to be: If the content type is html or xhtml, use selenium, 
otherwise just grab the bytes.  

However, with the current logic, if the content-type is null, nothing is 
pulled.  

My guess is that the logic should be : if the content type is not null and 
equals html or xhtml use selenium, otherwise grab the bytes.

Right?

{noformat}
  String contentType = getHeader(Response.CONTENT_TYPE);

  // handle with Selenium only if content type in HTML or XHTML
  if (contentType != null) {
{noformat}

  was:
It looks like the selenium protocol requires that there be content-type. 

The logic seems to be: If the content type is html or xhtml, use selenium, 
otherwise just grab the bytes.  

If the content-type is null, nothing is pulled.  

My guess is that the logic should be : if the content type is not null and 
equals html or xhtml use selenium, otherwise grab the bytes.

Right?

{noformat}
  String contentType = getHeader(Response.CONTENT_TYPE);

  // handle with Selenium only if content type in HTML or XHTML
  if (contentType != null) {
{noformat}


> protocol-selenium requires Content-Type header 
> ---
>
> Key: NUTCH-3001
> URL: https://issues.apache.org/jira/browse/NUTCH-3001
> Project: Nutch
>  Issue Type: Bug
>        Reporter: Tim Allison
>Priority: Major
>
> It looks like the selenium protocol requires that there be a content-type 
> header. 
> The logic seems to be: If the content type is html or xhtml, use selenium, 
> otherwise just grab the bytes.  
> However, with the current logic, if the content-type is null, nothing is 
> pulled.  
> My guess is that the logic should be : if the content type is not null and 
> equals html or xhtml use selenium, otherwise grab the bytes.
> Right?
> {noformat}
>   String contentType = getHeader(Response.CONTENT_TYPE);
>   // handle with Selenium only if content type in HTML or XHTML
>   if (contentType != null) {
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3001) protocol-selenium requires Content-Type header

2023-09-13 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated NUTCH-3001:
---
Description: 
It looks like the selenium protocol requires that there be content-type. 

The logic seems to be: If the content type is html or xhtml, use selenium, 
otherwise just grab the bytes.  

If the content-type is null, nothing is pulled.  

My guess is that the logic should be : if the content type is not null and 
equals html or xhtml use selenium, otherwise grab the bytes.

Right?

{noformat}
  String contentType = getHeader(Response.CONTENT_TYPE);

  // handle with Selenium only if content type in HTML or XHTML
  if (contentType != null) {
{noformat}

  was:
It looks like the selenium protocol requires that there be content-type. 

The logic seems to be: If the content type is html or xhtml, use selenium, 
otherwise just grab the bytes.  

If the content-type is null, nothing is pulled.  

My guess is that the logic should be : if the content type is not null and 
equals html or xhtml use selenium, otherwise grab the bytes.

Right?

{noformat}
  String contentType = getHeader(Response.CONTENT_TYPE);

  // handle with Selenium only if content type in HTML or XHTML
  if (contentType != null) {


> protocol-selenium requires Content-Type header 
> ---
>
> Key: NUTCH-3001
> URL: https://issues.apache.org/jira/browse/NUTCH-3001
> Project: Nutch
>  Issue Type: Bug
>        Reporter: Tim Allison
>Priority: Major
>
> It looks like the selenium protocol requires that there be content-type. 
> The logic seems to be: If the content type is html or xhtml, use selenium, 
> otherwise just grab the bytes.  
> If the content-type is null, nothing is pulled.  
> My guess is that the logic should be : if the content type is not null and 
> equals html or xhtml use selenium, otherwise grab the bytes.
> Right?
> {noformat}
>   String contentType = getHeader(Response.CONTENT_TYPE);
>   // handle with Selenium only if content type in HTML or XHTML
>   if (contentType != null) {
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3001) protocol-selenium requires Content-Type header

2023-09-13 Thread Tim Allison (Jira)
Tim Allison created NUTCH-3001:
--

 Summary: protocol-selenium requires Content-Type header 
 Key: NUTCH-3001
 URL: https://issues.apache.org/jira/browse/NUTCH-3001
 Project: Nutch
  Issue Type: Bug
Reporter: Tim Allison


It looks like the selenium protocol requires that there be content-type. 

The logic seems to be: If the content type is html or xhtml, use selenium, 
otherwise just grab the bytes.  

If the content-type is null, nothing is pulled.  

My guess is that the logic should be : if the content type is not null and 
equals html or xhtml use selenium, otherwise grab the bytes.

Right?

{noformat}
  String contentType = getHeader(Response.CONTENT_TYPE);

  // handle with Selenium only if content type in HTML or XHTML
  if (contentType != null) {



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3000) protocol-selenium returns only the body,strips off the element

2023-09-13 Thread Tim Allison (Jira)
Tim Allison created NUTCH-3000:
--

 Summary: protocol-selenium returns only the body,strips off the 
 element
 Key: NUTCH-3000
 URL: https://issues.apache.org/jira/browse/NUTCH-3000
 Project: Nutch
  Issue Type: Bug
  Components: protocol
Reporter: Tim Allison


The selenium protocol returns only the body portion of the html, which means 
that neither the title nor the other page metadata in the  section gets 
extracted.

{noformat}
String innerHtml = driver.findElement(By.tagName("body"))
.getAttribute("innerHTML");
{noformat}

We should return the full html, no?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2998) Remove the Any23 plugin

2023-09-12 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764376#comment-17764376
 ] 

Tim Allison commented on NUTCH-2998:


I don't want to make such a drastic change without at least a half-hearted +1 
and no -1. :D

Anyone have objections if we remove Any23 before the 1.20 release?

Are there other lists/communications channels I should pursue with this?

> Remove the Any23 plugin
> ---
>
> Key: NUTCH-2998
> URL: https://issues.apache.org/jira/browse/NUTCH-2998
> Project: Nutch
>  Issue Type: Task
>  Components: any23
>    Reporter: Tim Allison
>Priority: Major
>
> I'm not sure how we want to handle this.  Any23 moved to the Attic in June 
> 2023.  We should probably remove it from Nutch?  I'm not sure how abruptly we 
> want to do that.
> We could deprecate it for 1.20 and then remove it in 1.21 or later?  Or we 
> could choose to remove it for 1.20.
> What do you think?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2023-08-31 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760926#comment-17760926
 ] 

Tim Allison commented on NUTCH-2978:


K, I think https://github.com/apache/nutch/pull/772 is better.  This is nearly 
entirely based on [~markus17]'s patches.  Let me know what you think.

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978-1.patch, NUTCH-2978-2.patch, 
> NUTCH-2978-3.patch, NUTCH-2978-any23.patch, NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-2999) Update Lucene version to latest 8.x

2023-08-30 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved NUTCH-2999.

Resolution: Fixed

Updated PR should have fixed that issue.  Would be nice to add testcontainers 
containerized ES and OpenSearch for unit tests.  One day...

> Update Lucene version to latest 8.x
> ---
>
> Key: NUTCH-2999
> URL: https://issues.apache.org/jira/browse/NUTCH-2999
> Project: Nutch
>  Issue Type: Task
>        Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.20
>
>
> It may be the way that I'm loading the project, but, for me, Intellij really 
> does not like the Lucene version conflict between {{scoring-similarity}} and 
> the OpenSearch/Elasticsearch modules.
> Can we bump Lucene to the latest 8.11.2 throughout?
> PR for review incoming.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (NUTCH-2999) Update Lucene version to latest 8.x

2023-08-30 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison reopened NUTCH-2999:


The applied PR breaks the lucene-based indexers.

> Update Lucene version to latest 8.x
> ---
>
> Key: NUTCH-2999
> URL: https://issues.apache.org/jira/browse/NUTCH-2999
> Project: Nutch
>  Issue Type: Task
>        Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.20
>
>
> It may be the way that I'm loading the project, but, for me, Intellij really 
> does not like the Lucene version conflict between {{scoring-similarity}} and 
> the OpenSearch/Elasticsearch modules.
> Can we bump Lucene to the latest 8.11.2 throughout?
> PR for review incoming.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-2961) Upgrade dependencies of parsefilter-naivebayes

2023-08-30 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved NUTCH-2961.

Resolution: Fixed

I confirmed we can simply remove those dependencies.  I fixed this as part of 
NUTCH-2999

> Upgrade dependencies of parsefilter-naivebayes
> --
>
> Key: NUTCH-2961
> URL: https://issues.apache.org/jira/browse/NUTCH-2961
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.18
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> The dependencies (Mahout 0.9, Lucene 5.5.0) of parsefilter-naivebayes date 
> back to 2016/2017 and may need an upgrade.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-2999) Update Lucene version to latest 8.x

2023-08-30 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved NUTCH-2999.

Fix Version/s: 1.20
   Resolution: Fixed

Thank you [~markus17] for the review!

> Update Lucene version to latest 8.x
> ---
>
> Key: NUTCH-2999
> URL: https://issues.apache.org/jira/browse/NUTCH-2999
> Project: Nutch
>  Issue Type: Task
>        Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.20
>
>
> It may be the way that I'm loading the project, but, for me, Intellij really 
> does not like the Lucene version conflict between {{scoring-similarity}} and 
> the OpenSearch/Elasticsearch modules.
> Can we bump Lucene to the latest 8.11.2 throughout?
> PR for review incoming.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2999) Update Lucene version to latest 8.x

2023-08-30 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760512#comment-17760512
 ] 

Tim Allison commented on NUTCH-2999:


This PR also takes care of NUTCH-2961

> Update Lucene version to latest 8.x
> ---
>
> Key: NUTCH-2999
> URL: https://issues.apache.org/jira/browse/NUTCH-2999
> Project: Nutch
>  Issue Type: Task
>        Reporter: Tim Allison
>Priority: Minor
>
> It may be the way that I'm loading the project, but, for me, Intellij really 
> does not like the Lucene version conflict between {{scoring-similarity}} and 
> the OpenSearch/Elasticsearch modules.
> Can we bump Lucene to the latest 8.11.2 throughout?
> PR for review incoming.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2999) Update Lucene version to latest 8.x

2023-08-30 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760511#comment-17760511
 ] 

Tim Allison commented on NUTCH-2999:


https://github.com/apache/nutch/pull/770

> Update Lucene version to latest 8.x
> ---
>
> Key: NUTCH-2999
> URL: https://issues.apache.org/jira/browse/NUTCH-2999
> Project: Nutch
>  Issue Type: Task
>        Reporter: Tim Allison
>Priority: Minor
>
> It may be the way that I'm loading the project, but, for me, Intellij really 
> does not like the Lucene version conflict between {{scoring-similarity}} and 
> the OpenSearch/Elasticsearch modules.
> Can we bump Lucene to the latest 8.11.2 throughout?
> PR for review incoming.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-2999) Update Lucene version to latest 8.x

2023-08-30 Thread Tim Allison (Jira)
Tim Allison created NUTCH-2999:
--

 Summary: Update Lucene version to latest 8.x
 Key: NUTCH-2999
 URL: https://issues.apache.org/jira/browse/NUTCH-2999
 Project: Nutch
  Issue Type: Task
Reporter: Tim Allison


It may be the way that I'm loading the project, but, for me, Intellij really 
does not like the Lucene version conflict between {{scoring-similarity}} and 
the OpenSearch/Elasticsearch modules.

Can we bump Lucene to the latest 8.11.2 throughout?

PR for review incoming.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2961) Upgrade dependencies of parsefilter-naivebayes

2023-08-30 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760508#comment-17760508
 ] 

Tim Allison commented on NUTCH-2961:


It looks like neither mahout nor lucene are actually used any more.  I may be 
misreading the code...

Can we just get rid of them?

> Upgrade dependencies of parsefilter-naivebayes
> --
>
> Key: NUTCH-2961
> URL: https://issues.apache.org/jira/browse/NUTCH-2961
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.18
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> The dependencies (Mahout 0.9, Lucene 5.5.0) of parsefilter-naivebayes date 
> back to 2016/2017 and may need an upgrade.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-2998) Remove the Any23 plugin

2023-08-28 Thread Tim Allison (Jira)
Tim Allison created NUTCH-2998:
--

 Summary: Remove the Any23 plugin
 Key: NUTCH-2998
 URL: https://issues.apache.org/jira/browse/NUTCH-2998
 Project: Nutch
  Issue Type: Task
  Components: any23
Reporter: Tim Allison


I'm not sure how we want to handle this.  Any23 moved to the Attic in June 
2023.  We should probably remove it from Nutch?  I'm not sure how abruptly we 
want to do that.

We could deprecate it for 1.20 and then remove it in 1.21 or later?  Or we 
could choose to remove it for 1.20.

What do you think?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-2989) Can't have username/pw AND https on elastic-indexer?!

2023-08-28 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved NUTCH-2989.

Resolution: Fixed

Fellow Nutch devs, please let me know if I botched any of our processes in 
fixing this.  Thank you!

> Can't have username/pw AND https on elastic-indexer?!
> -
>
> Key: NUTCH-2989
> URL: https://issues.apache.org/jira/browse/NUTCH-2989
> Project: Nutch
>  Issue Type: Task
>  Components: indexer, plugin
>Affects Versions: 1.19
>    Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Major
> Fix For: 1.20
>
>
> While working on NUTCH-2920, I copied+pasted the elastic indexer.  As part of 
> that process, I noticed that basic auth doesn't work with https.
> {code:java}
> if (auth) {
> restClientBuilder
> .setHttpClientConfigCallback(new HttpClientConfigCallback() {
>   @Override
>   public HttpAsyncClientBuilder customizeHttpClient(
>   HttpAsyncClientBuilder arg0) {
> return 
> arg0.setDefaultCredentialsProvider(credentialsProvider);
>   }
> });
>   }
>   // In case of HTTPS, set the client up for ignoring problems with 
> self-signed
>   // certificates and stuff
>   if ("https".equals(scheme)) {
> try {
>   SSLContextBuilder sslBuilder = SSLContexts.custom();
>   sslBuilder.loadTrustMaterial(null, new TrustSelfSignedStrategy());
>   final SSLContext sslContext = sslBuilder.build();
>   restClientBuilder.setHttpClientConfigCallback(new 
> HttpClientConfigCallback() {
> @Override
> public HttpAsyncClientBuilder 
> customizeHttpClient(HttpAsyncClientBuilder httpClientBuilder) {
>   // ignore issues with self-signed certificates
>   
> httpClientBuilder.setSSLHostnameVerifier(NoopHostnameVerifier.INSTANCE);
>   return httpClientBuilder.setSSLContext(sslContext);
> }
>   });
> } catch (Exception e) {
>   LOG.error("Error setting up SSLContext because: " + e.getMessage(), 
> e);
> }
>   }
> {code}
> On NUTCH-2920, I fixed this for the opensearch-indexer by adding another {{if 
> (auth)}} statement under the {{https}} branch.
> If this is an actual issue, I'm happy to open a PR.  If I've misunderstood 
> the code or the design, please close as "not a problem".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (NUTCH-2989) Can't have username/pw AND https on elastic-indexer?!

2023-08-28 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison reassigned NUTCH-2989:
--

Assignee: Tim Allison

> Can't have username/pw AND https on elastic-indexer?!
> -
>
> Key: NUTCH-2989
> URL: https://issues.apache.org/jira/browse/NUTCH-2989
> Project: Nutch
>  Issue Type: Task
>  Components: indexer, plugin
>Affects Versions: 1.19
>    Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Major
> Fix For: 1.20
>
>
> While working on NUTCH-2920, I copied+pasted the elastic indexer.  As part of 
> that process, I noticed that basic auth doesn't work with https.
> {code:java}
> if (auth) {
> restClientBuilder
> .setHttpClientConfigCallback(new HttpClientConfigCallback() {
>   @Override
>   public HttpAsyncClientBuilder customizeHttpClient(
>   HttpAsyncClientBuilder arg0) {
> return 
> arg0.setDefaultCredentialsProvider(credentialsProvider);
>   }
> });
>   }
>   // In case of HTTPS, set the client up for ignoring problems with 
> self-signed
>   // certificates and stuff
>   if ("https".equals(scheme)) {
> try {
>   SSLContextBuilder sslBuilder = SSLContexts.custom();
>   sslBuilder.loadTrustMaterial(null, new TrustSelfSignedStrategy());
>   final SSLContext sslContext = sslBuilder.build();
>   restClientBuilder.setHttpClientConfigCallback(new 
> HttpClientConfigCallback() {
> @Override
> public HttpAsyncClientBuilder 
> customizeHttpClient(HttpAsyncClientBuilder httpClientBuilder) {
>   // ignore issues with self-signed certificates
>   
> httpClientBuilder.setSSLHostnameVerifier(NoopHostnameVerifier.INSTANCE);
>   return httpClientBuilder.setSSLContext(sslContext);
> }
>   });
> } catch (Exception e) {
>   LOG.error("Error setting up SSLContext because: " + e.getMessage(), 
> e);
> }
>   }
> {code}
> On NUTCH-2920, I fixed this for the opensearch-indexer by adding another {{if 
> (auth)}} statement under the {{https}} branch.
> If this is an actual issue, I'm happy to open a PR.  If I've misunderstood 
> the code or the design, please close as "not a problem".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [ANNOUNCE] New Nutch committer and PMC - Tim Allison

2023-07-20 Thread Tim Allison
Thank you, all!  I’m thrilled to join the team!

On Thu, Jul 20, 2023 at 9:42 AM Julien Nioche 
wrote:

> What a fantastic addition to the Nutch team! Congrats to Tim
>
> On Thu, 20 Jul 2023 at 10:20, Sebastian Nagel  wrote:
>
>> Dear all,
>>
>> It is my pleasure to announce that Tim Allison has joined us
>> as a committer and member of the Nutch PMC.
>>
>> You may already know Tim as a maintainer of and contributor to
>> Apache Tika. So, it was great to see contributions to the
>> Nutch source code from an experienced developer who is also
>> active in a related Apache project. Among other contributions
>> Tim recently implemented the indexer-opensearch plugin.
>>
>> Thank you, Tim Allison, and congratulations on your new role
>> in the Apache Nutch community! And welcome on board!
>>
>> Sebastian
>> (on behalf of the Nutch PMC)
>
>
>>
>
> --
>
> *Open Source Solutions for Text Engineering*
>
> http://www.digitalpebble.com
> http://digitalpebble.blogspot.com/
> #digitalpebble <http://twitter.com/digitalpebble>
>


[jira] [Updated] (NUTCH-2994) Implement an indexer for OpenSearch 2.x

2023-06-08 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated NUTCH-2994:
---
Description: 
Over on NUTCH-2920, we added an indexer for OpenSearch 1.x.  We should do this 
for 2.x.  This is blocked by: 
https://github.com/opensearch-project/opensearch-java/issues/181 

We could reinvent BulkProcessor/BulkIngester ourselves, but we really, really 
shouldn't do that.

  was:
Over on NUTCH-2920, we added an indexer for OpenSearch 1.x.  We should do this 
for 2.x.  The current blocker is: 
https://github.com/opensearch-project/opensearch-java/issues/181 

We could reinvent BulkProcessor/BulkIngester ourselves, but we really, really 
shouldn't do that.


> Implement an indexer for OpenSearch 2.x
> ---
>
> Key: NUTCH-2994
> URL: https://issues.apache.org/jira/browse/NUTCH-2994
> Project: Nutch
>  Issue Type: Improvement
>        Reporter: Tim Allison
>Priority: Major
>
> Over on NUTCH-2920, we added an indexer for OpenSearch 1.x.  We should do 
> this for 2.x.  This is blocked by: 
> https://github.com/opensearch-project/opensearch-java/issues/181 
> We could reinvent BulkProcessor/BulkIngester ourselves, but we really, really 
> shouldn't do that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-2994) Implement an indexer for OpenSearch 2.x

2023-06-08 Thread Tim Allison (Jira)
Tim Allison created NUTCH-2994:
--

 Summary: Implement an indexer for OpenSearch 2.x
 Key: NUTCH-2994
 URL: https://issues.apache.org/jira/browse/NUTCH-2994
 Project: Nutch
  Issue Type: Improvement
Reporter: Tim Allison


Over on NUTCH-2920, we added an indexer for OpenSearch 1.x.  We should do this 
for 2.x.  The current blocker is: 
https://github.com/opensearch-project/opensearch-java/issues/181 

We could reinvent BulkProcessor/BulkIngester ourselves, but we really, really 
shouldn't do that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.4.1

2023-05-24 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725842#comment-17725842
 ] 

Tim Allison commented on NUTCH-2959:


tika-server would be cleaner?  Could have autoscaling pods of tika-servers?

> Upgrade to Apache Tika 2.4.1
> 
>
> Key: NUTCH-2959
> URL: https://issues.apache.org/jira/browse/NUTCH-2959
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-2959.patch
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.4.1

2023-05-24 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725807#comment-17725807
 ] 

Tim Allison commented on NUTCH-2959:


Separately, I'm wondering if it would be useful to add an alternative Tika 
parser that relies on tika-server or a modified version of a pipes-parser.  
This would put all of the Tika dependencies and jar hell in its own process, 
and we wouldn't have to load any dependencies aside from tika-core into Nutch's 
jvm.

 

They're working on doing this over on Solr now as well (I think they've chosen 
the tika-server route).

> Upgrade to Apache Tika 2.4.1
> 
>
> Key: NUTCH-2959
> URL: https://issues.apache.org/jira/browse/NUTCH-2959
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-2959.patch
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.4.1

2023-05-24 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725805#comment-17725805
 ] 

Tim Allison commented on NUTCH-2959:


I just opened a PR to upgrade Tika to 2.8.0 on ANY23: 
https://issues.apache.org/jira/browse/ANY23-610 -> 
[https://github.com/apache/any23/pull/320] 

Let's see if we can get buy-in and maybe another release of any23?

> Upgrade to Apache Tika 2.4.1
> 
>
> Key: NUTCH-2959
> URL: https://issues.apache.org/jira/browse/NUTCH-2959
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-2959.patch
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-2989) Can't have username/pw AND https on elastic-indexer?!

2023-03-01 Thread Tim Allison (Jira)
Tim Allison created NUTCH-2989:
--

 Summary: Can't have username/pw AND https on elastic-indexer?!
 Key: NUTCH-2989
 URL: https://issues.apache.org/jira/browse/NUTCH-2989
 Project: Nutch
  Issue Type: Task
Reporter: Tim Allison


While working on NUTCH-2920, I copied+pasted the elastic indexer.  As part of 
that process, I noticed that basic auth doesn't work with https.


{code:java}
if (auth) {
restClientBuilder
.setHttpClientConfigCallback(new HttpClientConfigCallback() {
  @Override
  public HttpAsyncClientBuilder customizeHttpClient(
  HttpAsyncClientBuilder arg0) {
return arg0.setDefaultCredentialsProvider(credentialsProvider);
  }
});
  }

  // In case of HTTPS, set the client up for ignoring problems with 
self-signed
  // certificates and stuff
  if ("https".equals(scheme)) {
try {
  SSLContextBuilder sslBuilder = SSLContexts.custom();
  sslBuilder.loadTrustMaterial(null, new TrustSelfSignedStrategy());
  final SSLContext sslContext = sslBuilder.build();

  restClientBuilder.setHttpClientConfigCallback(new 
HttpClientConfigCallback() {
@Override
public HttpAsyncClientBuilder 
customizeHttpClient(HttpAsyncClientBuilder httpClientBuilder) {
  // ignore issues with self-signed certificates
  
httpClientBuilder.setSSLHostnameVerifier(NoopHostnameVerifier.INSTANCE);
  return httpClientBuilder.setSSLContext(sslContext);
}
  });
} catch (Exception e) {
  LOG.error("Error setting up SSLContext because: " + e.getMessage(), 
e);
}
  }
{code}

On NUTCH-2920, I fixed this for the opensearch-indexer by adding another {{if 
(auth)}} statement under the {{https}} branch.

If this is an actual issue, I'm happy to open a PR.  If I've misunderstood the 
code or the design, please close as "not a problem".




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-2988) Elasticsearch 7.13.2 compatible with ASL 2.0?

2023-03-01 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved NUTCH-2988.

Resolution: Duplicate

Duplicate.  Sorry!

> Elasticsearch 7.13.2 compatible with ASL 2.0?
> -
>
> Key: NUTCH-2988
> URL: https://issues.apache.org/jira/browse/NUTCH-2988
> Project: Nutch
>  Issue Type: Task
>        Reporter: Tim Allison
>Priority: Minor
> Attachments: LICENSE.txt
>
>
> In the latest release of at least the 1.x branch of Nutch, the elasticsearch 
> high level java client is at 7.13.2, which is after the great schism.  Or, 
> the last purely ASL 2.0 license was in 7.10.2.
> So, do we need to downgrade to 7.10.2 or is Elasticsearch's new licensing 
> plan suitable to be released within an ASF project?
> Or, is the client as opposed to the main search project still actually ASL 
> 2.0?
> Ref: https://github.com/elastic/elasticsearch/blob/v7.13.2/LICENSE.txt



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (NUTCH-2927) indexer-elastic: use Java API client

2023-03-01 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695217#comment-17695217
 ] 

Tim Allison edited comment on NUTCH-2927 at 3/1/23 5:26 PM:


Over on NUTCH-2920 , I stumbled into the blocker that [BulkProcessor doesn't 
yet exist for this 
client|https://issues.apache.org/jira/browse/NUTCH-2920?focusedCommentId=17695148=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17695148]
 in OpenSearch.  This is also the case for Elasticsearch: 
https://github.com/elastic/elasticsearch-java/issues/108

See the link on NUTCH-2920 for why this is important.  It is. 


was (Author: talli...@mitre.org):
Over on NUTCH-2920 , I stumbled into the blocker that [BulkProcessor doesn't 
yet exist for this client in 
OpenSearch|https://issues.apache.org/jira/browse/NUTCH-2920?focusedCommentId=17695148=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17695148].
  This is also the case for Elasticsearch: 
https://github.com/elastic/elasticsearch-java/issues/108

See the link on NUTCH-2920 for why this is important.  It is. 

> indexer-elastic: use Java API client
> 
>
> Key: NUTCH-2927
> URL: https://issues.apache.org/jira/browse/NUTCH-2927
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Affects Versions: 1.18
>Reporter: Sebastian Nagel
>Priority: Major
>  Labels: help-wanted
> Fix For: 1.20
>
>
> See Lewis comment in [PR 
> #713|https://github.com/apache/nutch/pull/703#issuecomment-1008159052] 
> (NUTCH-2903): "High Level REST Client was deprecated in ES 7.15.0 in favor of 
> the [Java API 
> Client|https://www.elastic.co/guide/en/elasticsearch/client/java-api-client/current/index.html];



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2927) indexer-elastic: use Java API client

2023-03-01 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695217#comment-17695217
 ] 

Tim Allison commented on NUTCH-2927:


Over on NUTCH-2920 , I stumbled into the blocker that [BulkProcessor doesn't 
yet exist for this client in 
OpenSearch|https://issues.apache.org/jira/browse/NUTCH-2920?focusedCommentId=17695148=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17695148].
  This is also the case for Elasticsearch: 
https://github.com/elastic/elasticsearch-java/issues/108

See the link on NUTCH-2920 for why this is important.  It is. 

> indexer-elastic: use Java API client
> 
>
> Key: NUTCH-2927
> URL: https://issues.apache.org/jira/browse/NUTCH-2927
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Affects Versions: 1.18
>Reporter: Sebastian Nagel
>Priority: Major
>  Labels: help-wanted
> Fix For: 1.20
>
>
> See Lewis comment in [PR 
> #713|https://github.com/apache/nutch/pull/703#issuecomment-1008159052] 
> (NUTCH-2903): "High Level REST Client was deprecated in ES 7.15.0 in favor of 
> the [Java API 
> Client|https://www.elastic.co/guide/en/elasticsearch/client/java-api-client/current/index.html];



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2920) Implement a indexer-opensearch plugin

2023-03-01 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695152#comment-17695152
 ] 

Tim Allison commented on NUTCH-2920:


Current proposal is to go with the high level rest client for 1.x for now and 
cheer on the successful completion of 
https://github.com/opensearch-project/opensearch-java/issues/181.

> Implement a indexer-opensearch plugin
> -
>
> Key: NUTCH-2920
> URL: https://issues.apache.org/jira/browse/NUTCH-2920
> Project: Nutch
>  Issue Type: New Feature
>  Components: plugin
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> We will be moving to AWS-managed OpenSearch in the near term and I would like 
> to index our content there.
> As of writing the OpenSearch project has published two plugin versions under 
> thw Apache License v2 so far
> https://github.com/opensearch-project/opensearch-java/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2920) Implement a indexer-opensearch plugin

2023-03-01 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695148#comment-17695148
 ] 

Tim Allison commented on NUTCH-2920:


Well, that was a funny notion...

Turns out there is no BulkProcessor currently in the regular java-client (only 
exists in the high level java client) -- 
https://github.com/opensearch-project/opensearch-java/issues/181

So, we can make bulk requests with the basic java client, but we'd have to 
cache the bulk operations and have logic for when to run the operations.

The BulkProcessor takes care of all of this and has triggers for when to send 
the bulk data (size or time) and has retry logic and some other useful things.

This means that we'd have to reimplement that functionality, which I did on 
Tika ... and I don't want to do again. LOL...

> Implement a indexer-opensearch plugin
> -
>
> Key: NUTCH-2920
> URL: https://issues.apache.org/jira/browse/NUTCH-2920
> Project: Nutch
>  Issue Type: New Feature
>  Components: plugin
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> We will be moving to AWS-managed OpenSearch in the near term and I would like 
> to index our content there.
> As of writing the OpenSearch project has published two plugin versions under 
> thw Apache License v2 so far
> https://github.com/opensearch-project/opensearch-java/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2920) Implement a indexer-opensearch plugin

2023-03-01 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695096#comment-17695096
 ] 

Tim Allison commented on NUTCH-2920:


My initial PR was a simple copy+paste with a few modifications of the 
ElasticsearchIndexWriter.  Part of that was to make review easier, and part of 
that was that I saw that the lower level java rest client was in beta and that 
OpenSearch was recommending still using the high-level rest client 
(https://opensearch.org/docs/1.2/clients/java/). 

In thinking about this more, I realize that this "beta" message was for 1.2.  
It is gone in 1.3 (https://opensearch.org/docs/1.3/clients/java/). Further, the 
high level rest client is deprecated in 2.x and will be removed in 3.x.

I'm going to rework the PR to use the more modern client.  This will make 
migrating to 2.x easier and hopefully require far fewer dependencies in 1.x?

> Implement a indexer-opensearch plugin
> -
>
> Key: NUTCH-2920
> URL: https://issues.apache.org/jira/browse/NUTCH-2920
> Project: Nutch
>  Issue Type: New Feature
>  Components: plugin
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> We will be moving to AWS-managed OpenSearch in the near term and I would like 
> to index our content there.
> As of writing the OpenSearch project has published two plugin versions under 
> thw Apache License v2 so far
> https://github.com/opensearch-project/opensearch-java/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2988) Elasticsearch 7.13.2 compatible with ASL 2.0?

2023-02-28 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17694744#comment-17694744
 ] 

Tim Allison commented on NUTCH-2988:


If you open the 7.13.2 jar file, there's just the two -- "Server Side Public 
License" and "Elastic License 2.0".  The word Apache is never used.

> Elasticsearch 7.13.2 compatible with ASL 2.0?
> -
>
> Key: NUTCH-2988
> URL: https://issues.apache.org/jira/browse/NUTCH-2988
> Project: Nutch
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
> Attachments: LICENSE.txt
>
>
> In the latest release of at least the 1.x branch of Nutch, the elasticsearch 
> high level java client is at 7.13.2, which is after the great schism.  Or, 
> the last purely ASL 2.0 license was in 7.10.2.
> So, do we need to downgrade to 7.10.2 or is Elasticsearch's new licensing 
> plan suitable to be released within an ASF project?
> Or, is the client as opposed to the main search project still actually ASL 
> 2.0?
> Ref: https://github.com/elastic/elasticsearch/blob/v7.13.2/LICENSE.txt



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2988) Elasticsearch 7.13.2 compatible with ASL 2.0?

2023-02-28 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated NUTCH-2988:
---
Attachment: LICENSE.txt

> Elasticsearch 7.13.2 compatible with ASL 2.0?
> -
>
> Key: NUTCH-2988
> URL: https://issues.apache.org/jira/browse/NUTCH-2988
> Project: Nutch
>  Issue Type: Task
>        Reporter: Tim Allison
>Priority: Minor
> Attachments: LICENSE.txt
>
>
> In the latest release of at least the 1.x branch of Nutch, the elasticsearch 
> high level java client is at 7.13.2, which is after the great schism.  Or, 
> the last purely ASL 2.0 license was in 7.10.2.
> So, do we need to downgrade to 7.10.2 or is Elasticsearch's new licensing 
> plan suitable to be released within an ASF project?
> Or, is the client as opposed to the main search project still actually ASL 
> 2.0?
> Ref: https://github.com/elastic/elasticsearch/blob/v7.13.2/LICENSE.txt



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2988) Elasticsearch 7.13.2 compatible with ASL 2.0?

2023-02-28 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated NUTCH-2988:
---
Description: 
In the latest release of at least the 1.x branch of Nutch, the elasticsearch 
high level java client is at 7.13.2, which is after the great schism.  Or, the 
last purely ASL 2.0 license was in 7.10.2.

So, do we need to downgrade to 7.10.2 or is Elasticsearch's new licensing plan 
suitable to be released within an ASF project?

Or, is the client as opposed to the main search project still actually ASL 2.0?

Ref: https://github.com/elastic/elasticsearch/blob/v7.13.2/LICENSE.txt

  was:
In the latest release of at least the 1.x branch of Nutch, the elasticsearch 
high level java client is at 7.13.2, which is after the great schism.  Or, the 
last purely ASL 2.0 license was in 7.10.2.

So, do we need to downgrade to 7.10.2 or is Elasticsearch's new licensing plan 
suitable to be released within an ASF project?

Or, is the client still actually ASL 2.0?

Ref: https://github.com/elastic/elasticsearch/blob/v7.13.2/LICENSE.txt


> Elasticsearch 7.13.2 compatible with ASL 2.0?
> -
>
> Key: NUTCH-2988
> URL: https://issues.apache.org/jira/browse/NUTCH-2988
> Project: Nutch
>  Issue Type: Task
>        Reporter: Tim Allison
>Priority: Minor
>
> In the latest release of at least the 1.x branch of Nutch, the elasticsearch 
> high level java client is at 7.13.2, which is after the great schism.  Or, 
> the last purely ASL 2.0 license was in 7.10.2.
> So, do we need to downgrade to 7.10.2 or is Elasticsearch's new licensing 
> plan suitable to be released within an ASF project?
> Or, is the client as opposed to the main search project still actually ASL 
> 2.0?
> Ref: https://github.com/elastic/elasticsearch/blob/v7.13.2/LICENSE.txt



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2988) Elasticsearch 7.13.2 compatible with ASL 2.0?

2023-02-28 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17694739#comment-17694739
 ] 

Tim Allison commented on NUTCH-2988:


Y, k. 
https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/_license.html.

Maven central which raised my initial concern is probably not the best 
resource: 
https://mvnrepository.com/artifact/org.elasticsearch.client/elasticsearch-rest-high-level-client/7.13.2
 and 

https://raw.githubusercontent.com/elastic/elasticsearch/v7.13.2/licenses/ELASTIC-LICENSE-2.0.txt

> Elasticsearch 7.13.2 compatible with ASL 2.0?
> -
>
> Key: NUTCH-2988
> URL: https://issues.apache.org/jira/browse/NUTCH-2988
> Project: Nutch
>  Issue Type: Task
>        Reporter: Tim Allison
>Priority: Minor
>
> In the latest release of at least the 1.x branch of Nutch, the elasticsearch 
> high level java client is at 7.13.2, which is after the great schism.  Or, 
> the last purely ASL 2.0 license was in 7.10.2.
> So, do we need to downgrade to 7.10.2 or is Elasticsearch's new licensing 
> plan suitable to be released within an ASF project?
> Or, is the client still actually ASL 2.0?
> Ref: https://github.com/elastic/elasticsearch/blob/v7.13.2/LICENSE.txt



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2988) Elasticsearch 7.13.2 compatible with ASL 2.0?

2023-02-28 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated NUTCH-2988:
---
Description: 
In the latest release of at least the 1.x branch of Nutch, the elasticsearch 
high level java client is at 7.13.2, which is after the great schism.  Or, the 
last purely ASL 2.0 license was in 7.10.2.

So, do we need to downgrade to 7.10.2 or is Elasticsearch's new licensing plan 
suitable to be released within an ASF project?

Or, is the client still actually ASL 2.0?

Ref: https://github.com/elastic/elasticsearch/blob/v7.13.2/LICENSE.txt

  was:
In the latest release of at least the 1.x branch of Nutch, the elasticsearch 
high level java client is at 7.13.2, which is after the great schism.  Or, the 
last purely ASL 2.0 license was in 7.10.2.

So, do we need to downgrade to 7.10.2 or is Elasticsearch's new licensing plan 
suitable to be released within an ASF project?

Ref: https://github.com/elastic/elasticsearch/blob/v7.13.2/LICENSE.txt


> Elasticsearch 7.13.2 compatible with ASL 2.0?
> -
>
> Key: NUTCH-2988
> URL: https://issues.apache.org/jira/browse/NUTCH-2988
> Project: Nutch
>  Issue Type: Task
>        Reporter: Tim Allison
>Priority: Minor
>
> In the latest release of at least the 1.x branch of Nutch, the elasticsearch 
> high level java client is at 7.13.2, which is after the great schism.  Or, 
> the last purely ASL 2.0 license was in 7.10.2.
> So, do we need to downgrade to 7.10.2 or is Elasticsearch's new licensing 
> plan suitable to be released within an ASF project?
> Or, is the client still actually ASL 2.0?
> Ref: https://github.com/elastic/elasticsearch/blob/v7.13.2/LICENSE.txt



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-2988) Elasticsearch 7.13.2 compatible with ASL 2.0?

2023-02-28 Thread Tim Allison (Jira)
Tim Allison created NUTCH-2988:
--

 Summary: Elasticsearch 7.13.2 compatible with ASL 2.0?
 Key: NUTCH-2988
 URL: https://issues.apache.org/jira/browse/NUTCH-2988
 Project: Nutch
  Issue Type: Task
Reporter: Tim Allison


In the latest release of at least the 1.x branch of Nutch, the elasticsearch 
high level java client is at 7.13.2, which is after the great schism.  Or, the 
last purely ASL 2.0 license was in 7.10.2.

So, do we need to downgrade to 7.10.2 or is Elasticsearch's new licensing plan 
suitable to be released within an ASF project?

Ref: https://github.com/elastic/elasticsearch/blob/v7.13.2/LICENSE.txt



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika

2019-09-27 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939516#comment-16939516
 ] 

Tim Allison edited comment on NUTCH-2457 at 9/27/19 2:55 PM:
-

W00t!  Default is to parse embedded, right? :D

Wouldn't want to break backwards compatibility!  

Kidding...I'm kidding...

Sorry, and thank you!


was (Author: talli...@mitre.org):
W00t!  Default is to parse embedded, right? :D

> Embedded documents likely not correctly parsed by Tika
> --
>
> Key: NUTCH-2457
> URL: https://issues.apache.org/jira/browse/NUTCH-2457
> Project: Nutch
>  Issue Type: Bug
>  Components: parser, plugin
>Affects Versions: 1.14
>    Reporter: Tim Allison
>Priority: Major
>  Labels: patch-available
> Fix For: 1.16
>
>
> While working on TIKA-2490, I think I found that Nutch's current method of 
> requesting a mime-specific parser for each file will fail to parse embedded 
> files, e.g. 
> https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx
> The fix should be straightforward, and I'll submit a PR once I can get Nutch 
> up and running in my dev environment. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika

2019-09-27 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939516#comment-16939516
 ] 

Tim Allison commented on NUTCH-2457:


W00t!  Default is to parse embedded, right? :D

> Embedded documents likely not correctly parsed by Tika
> --
>
> Key: NUTCH-2457
> URL: https://issues.apache.org/jira/browse/NUTCH-2457
> Project: Nutch
>  Issue Type: Bug
>  Components: parser, plugin
>Affects Versions: 1.14
>    Reporter: Tim Allison
>Priority: Major
>  Labels: patch-available
> Fix For: 1.16
>
>
> While working on TIKA-2490, I think I found that Nutch's current method of 
> requesting a mime-specific parser for each file will fail to parse embedded 
> files, e.g. 
> https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx
> The fix should be straightforward, and I'll submit a PR once I can get Nutch 
> up and running in my dev environment. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika

2019-09-27 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939478#comment-16939478
 ] 

Tim Allison commented on NUTCH-2457:


The issue is that the AutoDetectParser automatically/silently adds itself as a 
parser to the ParseContext.  When an embedded document is parsed, there's a 
lookup for the embedded parser in the ParseContext.  Because you weren't using 
the AutoDetectParser, there is no parser in ParseContext, and the embedded 
documents are not being parsed.

So, you have two options (maybe more...):

1) use the AutoDetectParser; set 
https://tika.apache.org/1.17/api/org/apache/tika/metadata/TikaCoreProperties.html#CONTENT_TYPE_OVERRIDE
 to the mime, and you'll avoid a second detection for the container file

2) Use your current method, but add a cached AutoDetectParser to the 
ParseContext

> Embedded documents likely not correctly parsed by Tika
> --
>
> Key: NUTCH-2457
> URL: https://issues.apache.org/jira/browse/NUTCH-2457
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>    Reporter: Tim Allison
>Priority: Major
> Fix For: 1.16
>
>
> While working on TIKA-2490, I think I found that Nutch's current method of 
> requesting a mime-specific parser for each file will fail to parse embedded 
> files, e.g. 
> https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx
> The fix should be straightforward, and I'll submit a PR once I can get Nutch 
> up and running in my dev environment. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika

2019-09-27 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939473#comment-16939473
 ] 

Tim Allison commented on NUTCH-2457:


Let me take a look at the code again...it has been a while...

> Embedded documents likely not correctly parsed by Tika
> --
>
> Key: NUTCH-2457
> URL: https://issues.apache.org/jira/browse/NUTCH-2457
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>    Reporter: Tim Allison
>Priority: Major
> Fix For: 1.16
>
>
> While working on TIKA-2490, I think I found that Nutch's current method of 
> requesting a mime-specific parser for each file will fail to parse embedded 
> files, e.g. 
> https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx
> The fix should be straightforward, and I'll submit a PR once I can get Nutch 
> up and running in my dev environment. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2586) Add a fallback mechanism for missing meta tags

2018-07-13 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542898#comment-16542898
 ] 

Tim Allison commented on NUTCH-2586:


Is this better handled at the Tika level...or is this something we should also 
add to Tika?

> Add a fallback mechanism for missing meta tags
> --
>
> Key: NUTCH-2586
> URL: https://issues.apache.org/jira/browse/NUTCH-2586
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Gerard Bouchar
>Priority: Major
>
> While using nutch, we faced the following issue: some web pages miss a 
> "description"  meta tag, but include an "og:description" meta (using the 
> [open graph protocol|http://ogp.me/]).
> Here are two examples: 
> * 
> http://imagenesdelavirgenmaria.com/17-imagenes-de-la-virgen-maria-de-guadalupe/
> * 
> http://mixcdsource.com/product/dj-arson-dj-sin-cerothe-hit-list-18-5-reggaeton-edition/
> It would be nice to have a configurable list of fallback meta tags to use 
> when the main meta tag is absent. Something that would allow us to specify, 
> in the configuration, "when the 'description' meta is missing, use 
> 'og:description', when 'title' is missing, use 'og:title', etc..." .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NUTCH-2578) Avoid lock by MimeUtil in constructor of protocol.Content

2018-05-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482879#comment-16482879
 ] 

Tim Allison edited comment on NUTCH-2578 at 5/21/18 6:38 PM:
-

Based on [~wastl-nagel]'s observation, I updated Apache Tika to reuse 
SAXParsers. I also added a multithreaded test a) for .xml files in our test 
suite and b) all files in our test suite to confirm that Tika.detect() is 
thread-safe.

The speedup for specifically xml root detection is impressive: see [this 
comment|https://issues.apache.org/jira/browse/TIKA-2645?focusedCommentId=16482862=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16482862]

As I wrote over on the Tika issue:
{quote}make sure to call XMLReaderUtils.setPoolSize(numThreads) to set an 
appropriate sized pool size for your needs...or if you can recommend a way for 
us to autosize, that'd be even better.
{quote}
 

If you are able to grab a nightly build and test on your machines/framework, 
let us know if you find any surprises!


was (Author: talli...@mitre.org):
Based on [~wastl-nagel]'s observation, I updated Apache Tika to reuse 
SAXParsers. I also added a multithreaded test a) for .xml files in our test 
suite and b) all files in our test suite to confirm that Tika.detect() is 
thread-safe.

The speedup for specifically xml root detection is impressive: see [this 
comment|https://issues.apache.org/jira/browse/TIKA-2645?focusedCommentId=16482862=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16482862]

As I wrote over on the Tika issue:
{quote}make sure to call XMLReaderUtils.setPoolSize(numThreads) to set an 
appropriate sized pool size for your needs...of if you can recommend a way for 
us to autosize, that'd be even better.
{quote}
 

If you are able to grab a nightly build and test on your machines/framework, 
let us know if you find any surprises!

> Avoid lock by MimeUtil in constructor of protocol.Content
> -
>
> Key: NUTCH-2578
> URL: https://issues.apache.org/jira/browse/NUTCH-2578
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol
>Affects Versions: 1.14
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
>
> The constructor of the class o.a.n.protocol.Content instantiates a new 
> MimeUtil object. That's not cheap as it always creates a new Tika object and 
> there is a lock on the job/jar file when config files are read:
> {noformat}
> "FetcherThread" #146 daemon prio=5 os_prio=0 tid=0x7f70523c3800 
> nid=0x1de2 waiting for monitor entry [0x7f70193a8000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at java.util.zip.ZipFile.getEntry(ZipFile.java:314)
> - waiting to lock <0x0005e0285758> (a java.util.jar.JarFile)
> at java.util.jar.JarFile.getEntry(JarFile.java:240)
> at java.util.jar.JarFile.getJarEntry(JarFile.java:223)
> at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1042)
> at 
> sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:1020)
> at sun.misc.URLClassPath$1.next(URLClassPath.java:267)
> at sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:277)
> at java.net.URLClassLoader$3$1.run(URLClassLoader.java:601)
> at java.net.URLClassLoader$3$1.run(URLClassLoader.java:599)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader$3.next(URLClassLoader.java:598)
> at java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:623)
> at sun.misc.CompoundEnumeration.next(CompoundEnumeration.java:45)
> at 
> sun.misc.CompoundEnumeration.hasMoreElements(CompoundEnumeration.java:54)
> at java.util.Collections.list(Collections.java:5239)
> at 
> org.apache.tika.config.ServiceLoader.identifyStaticServiceProviders(ServiceLoader.java:325)
> at 
> org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:352)
> at 
> org.apache.tika.config.ServiceLoader.loadServiceProviders(ServiceLoader.java:274)
> at 
> org.apache.tika.detect.DefaultEncodingDetector.(DefaultEncodingDetector.java:45)
> at 
> org.apache.tika.config.TikaConfig.getDefaultEncodingDetector(TikaConfig.java:92)
> at org.apache.tika.config.TikaConfig.(TikaConfig.java:248)
> at 
> org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:386)
> at org.apache.tika.Tika.(Tika.java:116)
> at org.apache.nutch.util.MimeUtil.(

[jira] [Comment Edited] (NUTCH-2578) Avoid lock by MimeUtil in constructor of protocol.Content

2018-05-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482879#comment-16482879
 ] 

Tim Allison edited comment on NUTCH-2578 at 5/21/18 6:37 PM:
-

Based on [~wastl-nagel]'s observation, I updated Apache Tika to reuse 
SAXParsers. I also added a multithreaded test a) for .xml files in our test 
suite and b) all files in our test suite to confirm that Tika.detect() is 
thread-safe.

The speedup for specifically xml root detection is impressive: see [this 
comment|https://issues.apache.org/jira/browse/TIKA-2645?focusedCommentId=16482862=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16482862]

As I wrote over on the Tika issue:
{quote}make sure to call XMLReaderUtils.setPoolSize(numThreads) to set an 
appropriate sized pool size for your needs...of if you can recommend a way for 
us to autosize, that'd be even better.
{quote}
 

If you are able to grab a nightly build and test on your machines/framework, 
let us know if you find any surprises!


was (Author: talli...@mitre.org):
Based on [~wastl-nagel]'s observation, I updated Apache Tika to reuse 
SAXParsers. I also added a multithreaded test a) for .xml files in our test 
suite and b) all files in our test suite to confirm that Tika.detect() is 
thread-safe.

The speedup for specifically xml root detection is impressive: see this comment.

As I wrote over on the Tika issue:
{quote}make sure to call XMLReaderUtils.setPoolSize(numThreads) to set an 
appropriate sized pool size for your needs...of if you can recommend a way for 
us to autosize, that'd be even better.
{quote}
 

If you are able to grab a nightly build and test on your machines/framework, 
let us know if you find any surprises!

> Avoid lock by MimeUtil in constructor of protocol.Content
> -
>
> Key: NUTCH-2578
> URL: https://issues.apache.org/jira/browse/NUTCH-2578
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol
>Affects Versions: 1.14
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
>
> The constructor of the class o.a.n.protocol.Content instantiates a new 
> MimeUtil object. That's not cheap as it always creates a new Tika object and 
> there is a lock on the job/jar file when config files are read:
> {noformat}
> "FetcherThread" #146 daemon prio=5 os_prio=0 tid=0x7f70523c3800 
> nid=0x1de2 waiting for monitor entry [0x7f70193a8000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at java.util.zip.ZipFile.getEntry(ZipFile.java:314)
> - waiting to lock <0x0005e0285758> (a java.util.jar.JarFile)
> at java.util.jar.JarFile.getEntry(JarFile.java:240)
> at java.util.jar.JarFile.getJarEntry(JarFile.java:223)
> at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1042)
> at 
> sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:1020)
> at sun.misc.URLClassPath$1.next(URLClassPath.java:267)
> at sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:277)
> at java.net.URLClassLoader$3$1.run(URLClassLoader.java:601)
> at java.net.URLClassLoader$3$1.run(URLClassLoader.java:599)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader$3.next(URLClassLoader.java:598)
> at java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:623)
> at sun.misc.CompoundEnumeration.next(CompoundEnumeration.java:45)
> at 
> sun.misc.CompoundEnumeration.hasMoreElements(CompoundEnumeration.java:54)
> at java.util.Collections.list(Collections.java:5239)
> at 
> org.apache.tika.config.ServiceLoader.identifyStaticServiceProviders(ServiceLoader.java:325)
> at 
> org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:352)
> at 
> org.apache.tika.config.ServiceLoader.loadServiceProviders(ServiceLoader.java:274)
> at 
> org.apache.tika.detect.DefaultEncodingDetector.(DefaultEncodingDetector.java:45)
> at 
> org.apache.tika.config.TikaConfig.getDefaultEncodingDetector(TikaConfig.java:92)
> at org.apache.tika.config.TikaConfig.(TikaConfig.java:248)
> at 
> org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:386)
> at org.apache.tika.Tika.(Tika.java:116)
> at org.apache.nutch.util.MimeUtil.(MimeUtil.java:69)
> at org.apache.nutch.protocol.Content.(Content.java:83)
> at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:316)
>

[jira] [Commented] (NUTCH-2578) Avoid lock by MimeUtil in constructor of protocol.Content

2018-05-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482879#comment-16482879
 ] 

Tim Allison commented on NUTCH-2578:


Based on [~wastl-nagel]'s observation, I updated Apache Tika to reuse 
SAXParsers. I also added a multithreaded test a) for .xml files in our test 
suite and b) all files in our test suite to confirm that Tika.detect() is 
thread-safe.

The speedup for specifically xml root detection is impressive: see this comment.

As I wrote over on the Tika issue:
{quote}make sure to call XMLReaderUtils.setPoolSize(numThreads) to set an 
appropriate sized pool size for your needs...of if you can recommend a way for 
us to autosize, that'd be even better.
{quote}

> Avoid lock by MimeUtil in constructor of protocol.Content
> -
>
> Key: NUTCH-2578
> URL: https://issues.apache.org/jira/browse/NUTCH-2578
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol
>Affects Versions: 1.14
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
>
> The constructor of the class o.a.n.protocol.Content instantiates a new 
> MimeUtil object. That's not cheap as it always creates a new Tika object and 
> there is a lock on the job/jar file when config files are read:
> {noformat}
> "FetcherThread" #146 daemon prio=5 os_prio=0 tid=0x7f70523c3800 
> nid=0x1de2 waiting for monitor entry [0x7f70193a8000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at java.util.zip.ZipFile.getEntry(ZipFile.java:314)
> - waiting to lock <0x0005e0285758> (a java.util.jar.JarFile)
> at java.util.jar.JarFile.getEntry(JarFile.java:240)
> at java.util.jar.JarFile.getJarEntry(JarFile.java:223)
> at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1042)
> at 
> sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:1020)
> at sun.misc.URLClassPath$1.next(URLClassPath.java:267)
> at sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:277)
> at java.net.URLClassLoader$3$1.run(URLClassLoader.java:601)
> at java.net.URLClassLoader$3$1.run(URLClassLoader.java:599)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader$3.next(URLClassLoader.java:598)
> at java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:623)
> at sun.misc.CompoundEnumeration.next(CompoundEnumeration.java:45)
> at 
> sun.misc.CompoundEnumeration.hasMoreElements(CompoundEnumeration.java:54)
> at java.util.Collections.list(Collections.java:5239)
> at 
> org.apache.tika.config.ServiceLoader.identifyStaticServiceProviders(ServiceLoader.java:325)
> at 
> org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:352)
> at 
> org.apache.tika.config.ServiceLoader.loadServiceProviders(ServiceLoader.java:274)
> at 
> org.apache.tika.detect.DefaultEncodingDetector.(DefaultEncodingDetector.java:45)
> at 
> org.apache.tika.config.TikaConfig.getDefaultEncodingDetector(TikaConfig.java:92)
> at org.apache.tika.config.TikaConfig.(TikaConfig.java:248)
> at 
> org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:386)
> at org.apache.tika.Tika.(Tika.java:116)
> at org.apache.nutch.util.MimeUtil.(MimeUtil.java:69)
> at org.apache.nutch.protocol.Content.(Content.java:83)
> at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:316)
> at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:341)
> {noformat}
> If there are many Fetcher threads this may cause a significant bottleneck, 
> running a Fetcher with 120 threads I've found up to 50 threads waiting for 
> this lock:
> {noformat}
> # pid 7195 is a Fetcher map task
> % sudo -u yarn jstack 7195 \
>   | grep -A25 'waiting to lock' \
>   | grep -F 'org.apache.tika.Tika.' \
>   | wc -l
> 49
> {noformat}
> As MimeUtil is thread-safe [including the called Tika 
> detector|https://www.mail-archive.com/user@tika.apache.org/msg00296.html], 
> the best solution seems to cache the MimeUtil object in the actual protocol 
> implementation as it is done in Nutch 2.x ([lib-http HttpBase, line 
> #151|https://github.com/apache/nutch/blob/2.x/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java#L151]).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika

2017-11-16 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1622#comment-1622
 ] 

Tim Allison commented on NUTCH-2457:


Before Tika 1.15 (I think...might have been 1.16?), you'd have to put the 
AutoDetectParser in the ParseContext to parse embedded documents, see: 
SOLR-7189.  However, you don't have to do that any more...

> Embedded documents likely not correctly parsed by Tika
> --
>
> Key: NUTCH-2457
> URL: https://issues.apache.org/jira/browse/NUTCH-2457
> Project: Nutch
>  Issue Type: Bug
>        Reporter: Tim Allison
>
> While working on TIKA-2490, I think I found that Nutch's current method of 
> requesting a mime-specific parser for each file will fail to parse embedded 
> files, e.g. 
> https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx
> The fix should be straightforward, and I'll submit a PR once I can get Nutch 
> up and running in my dev environment. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika

2017-11-16 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1620#comment-1620
 ] 

Tim Allison edited comment on NUTCH-2457 at 11/16/17 4:22 PM:
--

So, in lieu of a PR...please, please, please use the AutoDetectParser, like so:
{noformat}
+Parser parser = new AutoDetectParser(tikaConfig);
-Parser parser = tikaConfig.getParser(MediaType.parse(mimeType));
{noformat}

Your current method won't parse embedded documents/attachments.

Unit test: test that the extracted string contains "When in the Course of human 
events" on the file that I linked in the description.




was (Author: talli...@mitre.org):
So, in lieu of a PR...please, please, please use the AutoDetectParser, like so:
{noformat}
+Parser p = new AutoDetectParser(tikaConfig);
-Parser parser = tikaConfig.getParser(MediaType.parse(mimeType));
{noformat}

Your current method won't parse embedded documents/attachments.

Unit test: test that the extracted string contains "When in the Course of human 
events" on the file that I linked in the description.



> Embedded documents likely not correctly parsed by Tika
> --
>
> Key: NUTCH-2457
> URL: https://issues.apache.org/jira/browse/NUTCH-2457
> Project: Nutch
>  Issue Type: Bug
>Reporter: Tim Allison
>
> While working on TIKA-2490, I think I found that Nutch's current method of 
> requesting a mime-specific parser for each file will fail to parse embedded 
> files, e.g. 
> https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx
> The fix should be straightforward, and I'll submit a PR once I can get Nutch 
> up and running in my dev environment. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika

2017-11-16 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1620#comment-1620
 ] 

Tim Allison commented on NUTCH-2457:


So, in lieu of a PR...please, please, please use the AutoDetectParser, like so:
{noformat}
+Parser p = new AutoDetectParser(tikaConfig);
-Parser parser = tikaConfig.getParser(MediaType.parse(mimeType));
{noformat}

Your current method won't parse embedded documents/attachments.

Unit test: test that the extracted string contains "When in the Course of human 
events" on the file that I linked in the description.



> Embedded documents likely not correctly parsed by Tika
> --
>
> Key: NUTCH-2457
> URL: https://issues.apache.org/jira/browse/NUTCH-2457
> Project: Nutch
>  Issue Type: Bug
>Reporter: Tim Allison
>
> While working on TIKA-2490, I think I found that Nutch's current method of 
> requesting a mime-specific parser for each file will fail to parse embedded 
> files, e.g. 
> https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx
> The fix should be straightforward, and I'll submit a PR once I can get Nutch 
> up and running in my dev environment. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika

2017-11-16 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16255540#comment-16255540
 ] 

Tim Allison commented on NUTCH-2457:


I'm sure this is user error, and I need to put something else on my path, but 
I'm still getting this:

{noformat}
 java.lang.RuntimeException: x-point org.apache.nutch.protocol.Protocol not 
found.
at 
org.apache.nutch.protocol.ProtocolFactory.(ProtocolFactory.java:53)
at 
org.apache.nutch.tika.TestMSWordParser.getTextContent(TestMSWordParser.java:66)

{noformat}

> Embedded documents likely not correctly parsed by Tika
> --
>
> Key: NUTCH-2457
> URL: https://issues.apache.org/jira/browse/NUTCH-2457
> Project: Nutch
>  Issue Type: Bug
>        Reporter: Tim Allison
>
> While working on TIKA-2490, I think I found that Nutch's current method of 
> requesting a mime-specific parser for each file will fail to parse embedded 
> files, e.g. 
> https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx
> The fix should be straightforward, and I'll submit a PR once I can get Nutch 
> up and running in my dev environment. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


  1   2   >