[jira] [Updated] (NUTCH-3056) Injector to support resolving seed URLs

2024-05-16 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-3056:
-
Description: 
We have a case where clients submit huge uncurated seed files, the host may not 
longer exist, or redirect via-via to elsewhere, the protocol may be incorrect 
etc.

The large crawl itself is not supposed to venture much beyond the seed list, 
except for regex exceptions listed in 
{color:#00}db-ignore-external-exemptions{color}. It is also not allowed to 
jump to other domains/hosts to control the size of the crawl. This means 
externally redirecting seeds will not be crawled.

This ticket will add support for a multi-threaded 
host/domain/protocol/redirecter/resolver to the injector. Seeds not leading to 
a non-200 URL will be discarded. Enabling filtering and normalization is highly 
recommended for handling the redirects.

If you have a seed file with 10k+ or millions of records, you are highly 
recommended to split the input file in chunks so that multiple mappers can get 
to work. Passing a few millions records without resolving through one mapper is 
no problem, but resolving millions with one mapper, even if threaded, will take 
many hours.

  was:
We have a case where clients submit huge uncurated seed files, the host may not 
longer exist, or redirect via-via to elsewhere, the protocol may be incorrect 
etc.

The large crawl itself is not supposed to venture much beyond the seed list, 
except for regex exceptions listed in 
{color:#00}db-ignore-external-exemptions{color}. It is also not allowed to 
jump to other domains/hosts to control the size of the crawl. This means 
externally redirecting seeds will not be crawled.

This ticket will add support for a multi-threaded 
host/domain/protocol/redirecter/resolver to the injector.

If you have a seed file with 10k+ or millions of records, you are highly 
recommended to split the input file in chunks so that multiple mappers can get 
to work. Passing a few millions records without resolving through one mapper is 
no problem, but resolving millions with one mapper, even if threaded, will take 
many hours.


> Injector to support resolving seed URLs
> ---
>
> Key: NUTCH-3056
> URL: https://issues.apache.org/jira/browse/NUTCH-3056
> Project: Nutch
>  Issue Type: Improvement
>    Reporter: Markus Jelsma
>    Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.21
>
>
> We have a case where clients submit huge uncurated seed files, the host may 
> not longer exist, or redirect via-via to elsewhere, the protocol may be 
> incorrect etc.
> The large crawl itself is not supposed to venture much beyond the seed list, 
> except for regex exceptions listed in 
> {color:#00}db-ignore-external-exemptions{color}. It is also not allowed 
> to jump to other domains/hosts to control the size of the crawl. This means 
> externally redirecting seeds will not be crawled.
> This ticket will add support for a multi-threaded 
> host/domain/protocol/redirecter/resolver to the injector. Seeds not leading 
> to a non-200 URL will be discarded. Enabling filtering and normalization is 
> highly recommended for handling the redirects.
> If you have a seed file with 10k+ or millions of records, you are highly 
> recommended to split the input file in chunks so that multiple mappers can 
> get to work. Passing a few millions records without resolving through one 
> mapper is no problem, but resolving millions with one mapper, even if 
> threaded, will take many hours.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3056) Injector to support resolving seed URLs

2024-05-16 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-3056:
-
Description: 
We have a case where clients submit huge uncurated seed files, the host may not 
longer exist, or redirect via-via to elsewhere, the protocol may be incorrect 
etc.

The large crawl itself is not supposed to venture much beyond the seed list, 
except for regex exceptions listed in 
{color:#00}db-ignore-external-exemptions{color}. It is also not allowed to 
jump to other domains/hosts to control the size of the crawl. This means 
externally redirecting seeds will not be crawled.

This ticket will add support for a multi-threaded 
host/domain/protocol/redirecter/resolver to the injector.

If you have a seed file with 10k+ or millions of records, you are highly 
recommended to split the input file in chunks so that multiple mappers can get 
to work. Passing a few millions records without resolving through one mapper is 
no problem, but resolving millions with one mapper, even if threaded, will take 
many hours.

  was:
We have a case where clients submit huge uncurated seed files, the host may not 
longer exist, or redirect via-via to elsewhere, the protocol may be incorrect 
etc.

The large crawl itself is not supposed to venture much beyond the seed list, 
except for regex exceptions listed in 
{color:#00}db-ignore-external-exemptions{color}. It is also not allowed to 
jump to other domains/hosts to control the size of the crawl. This means 
externally redirecting seeds will not be crawled.

This ticket will add support for a multi-threaded 
host/domain/protocol/redirecter/resolver to the injector.


> Injector to support resolving seed URLs
> ---
>
> Key: NUTCH-3056
> URL: https://issues.apache.org/jira/browse/NUTCH-3056
> Project: Nutch
>  Issue Type: Improvement
>    Reporter: Markus Jelsma
>    Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.21
>
>
> We have a case where clients submit huge uncurated seed files, the host may 
> not longer exist, or redirect via-via to elsewhere, the protocol may be 
> incorrect etc.
> The large crawl itself is not supposed to venture much beyond the seed list, 
> except for regex exceptions listed in 
> {color:#00}db-ignore-external-exemptions{color}. It is also not allowed 
> to jump to other domains/hosts to control the size of the crawl. This means 
> externally redirecting seeds will not be crawled.
> This ticket will add support for a multi-threaded 
> host/domain/protocol/redirecter/resolver to the injector.
> If you have a seed file with 10k+ or millions of records, you are highly 
> recommended to split the input file in chunks so that multiple mappers can 
> get to work. Passing a few millions records without resolving through one 
> mapper is no problem, but resolving millions with one mapper, even if 
> threaded, will take many hours.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3056) Injector to support resolving seed URLs

2024-05-16 Thread Markus Jelsma (Jira)
Markus Jelsma created NUTCH-3056:


 Summary: Injector to support resolving seed URLs
 Key: NUTCH-3056
 URL: https://issues.apache.org/jira/browse/NUTCH-3056
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.21


We have a case where clients submit huge uncurated seed files, the host may not 
longer exist, or redirect via-via to elsewhere, the protocol may be incorrect 
etc.

The large crawl itself is not supposed to venture much beyond the seed list, 
except for regex exceptions listed in 
{color:#00}db-ignore-external-exemptions{color}. It is also not allowed to 
jump to other domains/hosts to control the size of the crawl. This means 
externally redirecting seeds will not be crawled.

This ticket will add support for a multi-threaded 
host/domain/protocol/redirecter/resolver to the injector.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-04-30 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842384#comment-17842384
 ] 

Markus Jelsma commented on NUTCH-3028:
--

Ok, the Content object is now also available in the evaluation. I added an 
example of it to the description above.

 

> WARCExported to support filtering by JEXL
> -
>
> Key: NUTCH-3028
> URL: https://issues.apache.org/jira/browse/NUTCH-3028
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>    Reporter: Markus Jelsma
>    Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.21
>
> Attachments: NUTCH-3028-1.patch, NUTCH-3028-2.patch, NUTCH-3028.patch
>
>
> Filtering segment data to WARC is now possible using JEXL expressions. In the 
> next example, all records with SOME_KEY=SOME_VALUE in their parseData 
> metadata are exported to WARC.
> {color:#00}-expr 
> 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}
> {color:#00}or {color}
> {color:#00}-expr 
> 'content.getMetadata().get("SOME_KEY").equals("SOME_VALUE")'{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-04-30 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-3028:
-
Attachment: NUTCH-3028-2.patch

> WARCExported to support filtering by JEXL
> -
>
> Key: NUTCH-3028
> URL: https://issues.apache.org/jira/browse/NUTCH-3028
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>    Reporter: Markus Jelsma
>    Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.21
>
> Attachments: NUTCH-3028-1.patch, NUTCH-3028-2.patch, NUTCH-3028.patch
>
>
> Filtering segment data to WARC is now possible using JEXL expressions. In the 
> next example, all records with SOME_KEY=SOME_VALUE in their parseData 
> metadata are exported to WARC.
> {color:#00}-expr 
> 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}
> {color:#00}or {color}
> {color:#00}-expr 
> 'content.getMetadata().get("SOME_KEY").equals("SOME_VALUE")'{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-04-30 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-3028:
-
Description: 
Filtering segment data to WARC is now possible using JEXL expressions. In the 
next example, all records with SOME_KEY=SOME_VALUE in their parseData metadata 
are exported to WARC.

{color:#00}-expr 
'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}

{color:#00}or {color}

{color:#00}-expr 
'content.getMetadata().get("SOME_KEY").equals("SOME_VALUE")'{color}

  was:
Filtering segment data to WARC is now possible using JEXL expressions. In the 
next example, all records with SOME_KEY=SOME_VALUE in their parseData metadata 
are exported to WARC.

{color:#00}-expr 
'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}


> WARCExported to support filtering by JEXL
> -
>
> Key: NUTCH-3028
> URL: https://issues.apache.org/jira/browse/NUTCH-3028
> Project: Nutch
>  Issue Type: Improvement
>    Affects Versions: 1.19
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.21
>
> Attachments: NUTCH-3028-1.patch, NUTCH-3028.patch
>
>
> Filtering segment data to WARC is now possible using JEXL expressions. In the 
> next example, all records with SOME_KEY=SOME_VALUE in their parseData 
> metadata are exported to WARC.
> {color:#00}-expr 
> 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}
> {color:#00}or {color}
> {color:#00}-expr 
> 'content.getMetadata().get("SOME_KEY").equals("SOME_VALUE")'{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3039) Failure to handle ftp:// URLs

2024-04-11 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836133#comment-17836133
 ] 

Markus Jelsma commented on NUTCH-3039:
--

Thanks for spotting that!

> Failure to handle ftp:// URLs
> -
>
> Key: NUTCH-3039
> URL: https://issues.apache.org/jira/browse/NUTCH-3039
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
>
> Nutch fails to handle ftp:// URLs:
> - URLNormalizerBasic returns the empty string because creating the URL 
> instance fails with a MalformedURLException:
>   {noformat}
> echo "ftp://ftp.example.com/path/file.txt; \
>   | nutch normalizerchecker -stdin -normalizer urlnormalizer-basic{noformat}
> - fetching a ftp:// URL with the protocol-ftp plugin enabled also fails due 
> to a MalformedURLException:
>   {noformat}
> bin/nutch parsechecker -Dplugin.includes='protocol-ftp|parse-tika' \
>"ftp://ftp.example.com/path/file.txt;
> ...
> Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: 
> java.net.MalformedURLException
> at 
> org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:113)
> ...{noformat}
> The issue is caused by NUTCH-2429:
> - we do not provide a dedicated URL stream handler for ftp URLs
> - but also do not pass ftp:// URLs to the standard JVM handler



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-03-14 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827048#comment-17827048
 ] 

Markus Jelsma commented on NUTCH-3029:
--

comment describing throws is also required these days.

   a8ec17ca8..98902236d  master -> master

> Host specific max. and min. intervals in adaptive scheduler
> ---
>
> Key: NUTCH-3029
> URL: https://issues.apache.org/jira/browse/NUTCH-3029
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 1.19, 1.20
>Reporter: Martin Djukanovic
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.20
>
> Attachments: adaptive-host-specific-intervals.txt.template, 
> new_adaptive_fetch_schedule-1.patch
>
>
> This patch implements custom max. and min. refetching intervals for specific 
> hosts, in the AdaptiveFetchSchedule class. The intervals are set up in a .txt 
> configuration file (template also attached).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-03-13 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17826823#comment-17826823
 ] 

Markus Jelsma commented on NUTCH-3029:
--

throws was missing too

   84cda2abd..a8ec17ca8  master -> master

> Host specific max. and min. intervals in adaptive scheduler
> ---
>
> Key: NUTCH-3029
> URL: https://issues.apache.org/jira/browse/NUTCH-3029
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 1.19, 1.20
>Reporter: Martin Djukanovic
>    Assignee: Markus Jelsma
>Priority: Minor
> Attachments: adaptive-host-specific-intervals.txt.template, 
> new_adaptive_fetch_schedule-1.patch
>
>
> This patch implements custom max. and min. refetching intervals for specific 
> hosts, in the AdaptiveFetchSchedule class. The intervals are set up in a .txt 
> configuration file (template also attached).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-03-13 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17826783#comment-17826783
 ] 

Markus Jelsma commented on NUTCH-3029:
--

Thanks Lewis!

   5ba50c0c6..84cda2abd  master -> master



 

> Host specific max. and min. intervals in adaptive scheduler
> ---
>
> Key: NUTCH-3029
> URL: https://issues.apache.org/jira/browse/NUTCH-3029
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 1.19, 1.20
>Reporter: Martin Djukanovic
>    Assignee: Markus Jelsma
>Priority: Minor
> Attachments: adaptive-host-specific-intervals.txt.template, 
> new_adaptive_fetch_schedule-1.patch
>
>
> This patch implements custom max. and min. refetching intervals for specific 
> hosts, in the AdaptiveFetchSchedule class. The intervals are set up in a .txt 
> configuration file (template also attached).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-03-13 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17826759#comment-17826759
 ] 

Markus Jelsma commented on NUTCH-3029:
--

   4f62dec0f..5ba50c0c6  master -> master



actual change was missing from the commit for some reason

> Host specific max. and min. intervals in adaptive scheduler
> ---
>
> Key: NUTCH-3029
> URL: https://issues.apache.org/jira/browse/NUTCH-3029
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 1.19, 1.20
>Reporter: Martin Djukanovic
>    Assignee: Markus Jelsma
>Priority: Minor
> Attachments: adaptive-host-specific-intervals.txt.template, 
> new_adaptive_fetch_schedule-1.patch
>
>
> This patch implements custom max. and min. refetching intervals for specific 
> hosts, in the AdaptiveFetchSchedule class. The intervals are set up in a .txt 
> configuration file (template also attached).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3033) Upgrade Ivy to v2.5.2

2024-03-13 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17826760#comment-17826760
 ] 

Markus Jelsma commented on NUTCH-3033:
--

Ah, the new ivy works like a charm!

Thanks!

> Upgrade Ivy to v2.5.2
> -
>
> Key: NUTCH-3033
> URL: https://issues.apache.org/jira/browse/NUTCH-3033
> Project: Nutch
>  Issue Type: Task
>  Components: ivy
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> Ivy v2.5.2 was released August 20th 2023. Let’s upgrade.
> [https://ant.apache.org/ivy/history/2.5.2/release-notes.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-03-13 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-3029.
--
Resolution: Fixed

Thanks Martin!

   551c50b1c..4642c30c2  master -> master

> Host specific max. and min. intervals in adaptive scheduler
> ---
>
> Key: NUTCH-3029
> URL: https://issues.apache.org/jira/browse/NUTCH-3029
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 1.19, 1.20
>Reporter: Martin Djukanovic
>    Assignee: Markus Jelsma
>Priority: Minor
> Attachments: adaptive-host-specific-intervals.txt.template, 
> new_adaptive_fetch_schedule-1.patch
>
>
> This patch implements custom max. and min. refetching intervals for specific 
> hosts, in the AdaptiveFetchSchedule class. The intervals are set up in a .txt 
> configuration file (template also attached).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3030) Use system default cipher suites instead of hard-coded set

2024-03-13 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-3030.
--
Resolution: Fixed

42b55f6a9..551c50b1c  master -> master

 

Thanks Martin!

 

> Use system default cipher suites instead of hard-coded set
> --
>
> Key: NUTCH-3030
> URL: https://issues.apache.org/jira/browse/NUTCH-3030
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Martin Djukanovic
>    Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-3030.patch, default_ciphers_and_protocols-2.patch
>
>
> If http.tls.supported.cipher.suites is not set in the configuration, it 
> defaults to a hard-coded list which is not exhaustive enough. I have 
> encountered websites that exclusively use ciphers which are not included, so 
> they could not be handled by protocol-http.
> I changed this list to the system default -- SSLSocketFactory's 
> .getDefaultCipherSuites() to be precise. One could also use 
> .getSupportedCipherSuites() here, I suppose.
> The original list should be moved to nutch-default.xml or omitted altogether. 
> The protocol list is still hard-coded, but it is now also added to 
> nutch-default.xml (so it can be easily changed manually if needed).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3030) Use system default cipher suites instead of hard-coded set

2024-03-13 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-3030:
-
Summary: Use system default cipher suites instead of hard-coded set  (was: 
Update default TLS cipher suites for http(s) protocol)

> Use system default cipher suites instead of hard-coded set
> --
>
> Key: NUTCH-3030
> URL: https://issues.apache.org/jira/browse/NUTCH-3030
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Martin Djukanovic
>    Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-3030.patch, default_ciphers_and_protocols-2.patch
>
>
> If http.tls.supported.cipher.suites is not set in the configuration, it 
> defaults to a hard-coded list which is not exhaustive enough. I have 
> encountered websites that exclusively use ciphers which are not included, so 
> they could not be handled by protocol-http.
> I changed this list to the system default -- SSLSocketFactory's 
> .getDefaultCipherSuites() to be precise. One could also use 
> .getSupportedCipherSuites() here, I suppose.
> The original list should be moved to nutch-default.xml or omitted altogether. 
> The protocol list is still hard-coded, but it is now also added to 
> nutch-default.xml (so it can be easily changed manually if needed).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances

2024-03-12 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825863#comment-17825863
 ] 

Markus Jelsma commented on NUTCH-3032:
--

No idea what git fork is supposed to do, maybe it should be a git branch 
instead. I am not an skilled Git user, but you can always attach a patch to 
this ticket.

> Indexing plugin as an adapter for end user's own POJO instances
> ---
>
> Key: NUTCH-3032
> URL: https://issues.apache.org/jira/browse/NUTCH-3032
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Joe Gilvary
>Priority: Major
>  Labels: indexing
>
> It could be helpful to let end users manipulate information at indexing time 
> with their own code without the need for writing their own indexing plugin. I 
> mentioned this on the dev mailing list 
> (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some 
> description of my work in progress.
> One potential use is to address some of the same concerns that NUTCH-585 
> discusses regarding an alternative approach to picking and choosing which 
> content to index, but this approach would allow making index time decisions, 
> rather than setting the configuration for all content at the start of the 
> indexing run.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3031) ProtocolFactory host mapper to support domains

2024-03-12 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-3031.
--
Resolution: Fixed

   83acd501e..c390dfc8b  master -> master

> ProtocolFactory host mapper to support domains
> --
>
> Key: NUTCH-3031
> URL: https://issues.apache.org/jira/browse/NUTCH-3031
> Project: Nutch
>  Issue Type: Improvement
>    Reporter: Markus Jelsma
>    Assignee: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-3031.patch
>
>
> Currently ProtocolFactory supports different protocol plugins based on the 
> host configured for it. This patch will add support for listing domains as 
> well so you don't have to list numerous subdomains for one larger domain.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Release Nutch 1.20

2024-03-10 Thread Markus Jelsma
Good idea! I'll finish work on three open issues the next week.

Op za 9 mrt 2024 om 13:02 schreef Sebastian Nagel <
wastl.na...@googlemail.com>:

> Hi Lewis,
>
> yes, of course!
>
> Some points we should do before the release:
>
> - address the ES licensing issue,
>the easiest way is to downgrade, see NUTCH-3008
>If done update the license-related files.
>
> - there are three short PRs open
>
> I'll try to have a look at these points the next days.
>
> Best,
> Sebastian
>
>
> On 3/8/24 01:43, lewis john mcgibbney wrote:
> > Hi dev@,
> > As of today, 51 issues have been addressed in the 1.20 development drive.
> > https://issues.apache.org/jira/projects/NUTCH/versions/12352190
> > 
> > I would like to push a release soon and ship it to the user community.
> > Any objections?
> > Thank you
> > lewismc
> >
>


[jira] [Updated] (NUTCH-3031) ProtocolFactory host mapper to support domains

2024-03-05 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-3031:
-
Attachment: NUTCH-3031.patch

> ProtocolFactory host mapper to support domains
> --
>
> Key: NUTCH-3031
> URL: https://issues.apache.org/jira/browse/NUTCH-3031
> Project: Nutch
>  Issue Type: Improvement
>    Reporter: Markus Jelsma
>    Assignee: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-3031.patch
>
>
> Currently ProtocolFactory supports different protocol plugins based on the 
> host configured for it. This patch will add support for listing domains as 
> well so you don't have to list numerous subdomains for one larger domain.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3031) ProtocolFactory host mapper to support domains

2024-03-05 Thread Markus Jelsma (Jira)
Markus Jelsma created NUTCH-3031:


 Summary: ProtocolFactory host mapper to support domains
 Key: NUTCH-3031
 URL: https://issues.apache.org/jira/browse/NUTCH-3031
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.20


Currently ProtocolFactory supports different protocol plugins based on the host 
configured for it. This patch will add support for listing domains as well so 
you don't have to list numerous subdomains for one larger domain.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3030) Update default TLS cipher suites for http(s) protocol

2024-02-19 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17818531#comment-17818531
 ] 

Markus Jelsma commented on NUTCH-3030:
--

For some reason the attached patch did not apply cleanly (error on line 96), 
added new patch that does apply without complaining.

> Update default TLS cipher suites for http(s) protocol
> -
>
> Key: NUTCH-3030
> URL: https://issues.apache.org/jira/browse/NUTCH-3030
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Martin Djukanovic
>    Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-3030.patch, default_ciphers_and_protocols-2.patch
>
>
> If http.tls.supported.cipher.suites is not set in the configuration, it 
> defaults to a hard-coded list which is not exhaustive enough. I have 
> encountered websites that exclusively use ciphers which are not included, so 
> they could not be handled by protocol-http.
> I changed this list to the system default -- SSLSocketFactory's 
> .getDefaultCipherSuites() to be precise. One could also use 
> .getSupportedCipherSuites() here, I suppose.
> The original list should be moved to nutch-default.xml or omitted altogether. 
> The protocol list is still hard-coded, but it is now also added to 
> nutch-default.xml (so it can be easily changed manually if needed).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3030) Update default TLS cipher suites for http(s) protocol

2024-02-19 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-3030:
-
Attachment: NUTCH-3030.patch

> Update default TLS cipher suites for http(s) protocol
> -
>
> Key: NUTCH-3030
> URL: https://issues.apache.org/jira/browse/NUTCH-3030
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Martin Djukanovic
>    Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-3030.patch, default_ciphers_and_protocols-2.patch
>
>
> If http.tls.supported.cipher.suites is not set in the configuration, it 
> defaults to a hard-coded list which is not exhaustive enough. I have 
> encountered websites that exclusively use ciphers which are not included, so 
> they could not be handled by protocol-http.
> I changed this list to the system default -- SSLSocketFactory's 
> .getDefaultCipherSuites() to be precise. One could also use 
> .getSupportedCipherSuites() here, I suppose.
> The original list should be moved to nutch-default.xml or omitted altogether. 
> The protocol list is still hard-coded, but it is now also added to 
> nutch-default.xml (so it can be easily changed manually if needed).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (NUTCH-3030) Update default TLS cipher suites for http(s) protocol

2024-02-19 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma reassigned NUTCH-3030:


Assignee: Markus Jelsma

> Update default TLS cipher suites for http(s) protocol
> -
>
> Key: NUTCH-3030
> URL: https://issues.apache.org/jira/browse/NUTCH-3030
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Martin Djukanovic
>    Assignee: Markus Jelsma
>Priority: Minor
> Attachments: default_ciphers_and_protocols-2.patch
>
>
> If http.tls.supported.cipher.suites is not set in the configuration, it 
> defaults to a hard-coded list which is not exhaustive enough. I have 
> encountered websites that exclusively use ciphers which are not included, so 
> they could not be handled by protocol-http.
> I changed this list to the system default -- SSLSocketFactory's 
> .getDefaultCipherSuites() to be precise. One could also use 
> .getSupportedCipherSuites() here, I suppose.
> The original list should be moved to nutch-default.xml or omitted altogether. 
> The protocol list is still hard-coded, but it is now also added to 
> nutch-default.xml (so it can be easily changed manually if needed).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-02-19 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma reassigned NUTCH-3029:


Assignee: Markus Jelsma

> Host specific max. and min. intervals in adaptive scheduler
> ---
>
> Key: NUTCH-3029
> URL: https://issues.apache.org/jira/browse/NUTCH-3029
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 1.19, 1.20
>Reporter: Martin Djukanovic
>    Assignee: Markus Jelsma
>Priority: Minor
> Attachments: adaptive-host-specific-intervals.txt.template, 
> new_adaptive_fetch_schedule-1.patch
>
>
> This patch implements custom max. and min. refetching intervals for specific 
> hosts, in the AdaptiveFetchSchedule class. The intervals are set up in a .txt 
> configuration file (template also attached).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Tika parsing error 1.19

2024-02-15 Thread Markus Jelsma
Hi,

We were doing some tests with 1.19 and found that some sites became
unparsable using Tika. At this moment i know of at least two sites causing
this, my own, https://www.openindex.io/ and https://www.elzendaalcollege.nl/

2024-02-15 12:33:49,639 WARN o.a.n.p.ParseUtil [main] Error parsing
https://www.elzendaalcollege.nl/ with
org.apache.nutch.parse.tika.TikaParser
java.util.concurrent.ExecutionException: java.lang.NoSuchFieldError:
NUM_IMAGES
   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
~[?:?]
   at java.util.concurrent.FutureTask.get(FutureTask.java:205) ~[?:?]
   at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188)
~[apache-nutch-1.20-SNAPSHOT.jar:?]
   at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92)
~[apache-nutch-1.20-SNAPSHOT.jar:?]
   at
org.apache.nutch.parse.ParserChecker.process(ParserChecker.java:266)
~[apache-nutch-1.20-SNAPSHOT.jar:?]
   at
org.apache.nutch.util.AbstractChecker.processSingle(AbstractChecker.java:86)
~[apache-nutch-1.20-SNAPSHOT.jar:?]
   at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:150)
~[apache-nutch-1.20-SNAPSHOT.jar:?]
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:81)
~[hadoop-common-3.3.4.jar:?]
   at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:308)
~[apache-nutch-1.20-SNAPSHOT.jar:?]
Caused by: java.lang.NoSuchFieldError: NUM_IMAGES
   at
org.apache.tika.parser.image.ImageParser.extractMetadata(ImageParser.java:177)
~[?:?]
   at
org.apache.tika.parser.image.AbstractImageParser.parse(AbstractImageParser.java:79)
~[?:?]
   at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
~[tika-core-2.3.0.jar:2.3.0]
   at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
~[tika-core-2.3.0.jar:2.3.0]
   at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
~[tika-core-2.3.0.jar:2.3.0]
   at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:185)
~[tika-core-2.3.0.jar:2.3.0]
   at
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71)
~[tika-core-2.3.0.jar:2.3.0]
   at
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:106)
~[tika-core-2.3.0.jar:2.3.0]
   at
org.apache.tika.parser.html.HtmlHandler.handleDataURIScheme(HtmlHandler.java:385)
~[?:?]
   at
org.apache.tika.parser.html.HtmlHandler.startElement(HtmlHandler.java:187)
~[?:?]
   at
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:123)
~[tika-core-2.3.0.jar:2.3.0]
   at
org.apache.tika.parser.html.XHTMLDowngradeHandler.startElement(XHTMLDowngradeHandler.java:59)
~[?:?]
   at org.ccil.cowan.tagsoup.Parser.push(Parser.java:794) ~[?:?]
   at org.ccil.cowan.tagsoup.Parser.rectify(Parser.java:1061) ~[?:?]
   at org.ccil.cowan.tagsoup.Parser.stage(Parser.java:1026) ~[?:?]
   at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:633)
~[?:?]
   at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449) ~[?:?]
   at
org.apache.tika.parser.html.HtmlParser.parseImpl(HtmlParser.java:149)
~[?:?]
   at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:99)
~[?:?]
   at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
~[tika-core-2.3.0.jar:2.3.0]
   at
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151) ~[?:?]
   at
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90) ~[?:?]
   at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34)
~[apache-nutch-1.20-SNAPSHOT.jar:?]
   at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
~[apache-nutch-1.20-SNAPSHOT.jar:?]
   at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
   at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
~[?:?]
   at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
~[?:?]
   at java.lang.Thread.run(Thread.java:829) ~[?:?]


[jira] [Commented] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-02-07 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17815345#comment-17815345
 ] 

Markus Jelsma commented on NUTCH-3028:
--

New patch: when expression was not set, an exception was raised.

> WARCExported to support filtering by JEXL
> -
>
> Key: NUTCH-3028
> URL: https://issues.apache.org/jira/browse/NUTCH-3028
> Project: Nutch
>  Issue Type: Improvement
>    Reporter: Markus Jelsma
>    Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-3028-1.patch, NUTCH-3028.patch
>
>
> Filtering segment data to WARC is now possible using JEXL expressions. In the 
> next example, all records with SOME_KEY=SOME_VALUE in their parseData 
> metadata are exported to WARC.
> {color:#00}-expr 
> 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-02-07 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-3028:
-
Attachment: NUTCH-3028-1.patch

> WARCExported to support filtering by JEXL
> -
>
> Key: NUTCH-3028
> URL: https://issues.apache.org/jira/browse/NUTCH-3028
> Project: Nutch
>  Issue Type: Improvement
>    Reporter: Markus Jelsma
>    Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-3028-1.patch, NUTCH-3028.patch
>
>
> Filtering segment data to WARC is now possible using JEXL expressions. In the 
> next example, all records with SOME_KEY=SOME_VALUE in their parseData 
> metadata are exported to WARC.
> {color:#00}-expr 
> 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-02-06 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17814731#comment-17814731
 ] 

Markus Jelsma commented on NUTCH-3028:
--

Any objections to this one before i get it in?

> WARCExported to support filtering by JEXL
> -
>
> Key: NUTCH-3028
> URL: https://issues.apache.org/jira/browse/NUTCH-3028
> Project: Nutch
>  Issue Type: Improvement
>    Reporter: Markus Jelsma
>    Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-3028.patch
>
>
> Filtering segment data to WARC is now possible using JEXL expressions. In the 
> next example, all records with SOME_KEY=SOME_VALUE in their parseData 
> metadata are exported to WARC.
> {color:#00}-expr 
> 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-02-01 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-3028:
-
Description: 
Filtering segment data to WARC is now possible using JEXL expressions. In the 
next example, all records with SOME_KEY=SOME_VALUE in their parseData metadata 
are exported to WARC.

{color:#00}-expr 
'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}

> WARCExported to support filtering by JEXL
> -
>
> Key: NUTCH-3028
> URL: https://issues.apache.org/jira/browse/NUTCH-3028
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-3028.patch
>
>
> Filtering segment data to WARC is now possible using JEXL expressions. In the 
> next example, all records with SOME_KEY=SOME_VALUE in their parseData 
> metadata are exported to WARC.
> {color:#00}-expr 
> 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-02-01 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-3028:
-
Attachment: NUTCH-3027.patch

> WARCExported to support filtering by JEXL
> -
>
> Key: NUTCH-3028
> URL: https://issues.apache.org/jira/browse/NUTCH-3028
> Project: Nutch
>  Issue Type: Improvement
>    Reporter: Markus Jelsma
>    Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-3028.patch
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-02-01 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-3028:
-
Attachment: (was: NUTCH-3027.patch)

> WARCExported to support filtering by JEXL
> -
>
> Key: NUTCH-3028
> URL: https://issues.apache.org/jira/browse/NUTCH-3028
> Project: Nutch
>  Issue Type: Improvement
>    Reporter: Markus Jelsma
>    Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-3028.patch
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-02-01 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-3028:
-
Attachment: NUTCH-3028.patch

> WARCExported to support filtering by JEXL
> -
>
> Key: NUTCH-3028
> URL: https://issues.apache.org/jira/browse/NUTCH-3028
> Project: Nutch
>  Issue Type: Improvement
>    Reporter: Markus Jelsma
>    Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-3028.patch
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-02-01 Thread Markus Jelsma (Jira)
Markus Jelsma created NUTCH-3028:


 Summary: WARCExported to support filtering by JEXL
 Key: NUTCH-3028
 URL: https://issues.apache.org/jira/browse/NUTCH-3028
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work started] (NUTCH-3027) Trivial resource leak patch in DomainSuffixes.java

2024-01-19 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3027 started by Markus Jelsma.

> Trivial resource leak patch in DomainSuffixes.java
> --
>
> Key: NUTCH-3027
> URL: https://issues.apache.org/jira/browse/NUTCH-3027
> Project: Nutch
>  Issue Type: Bug
>  Components: util
>Affects Versions: 1.20
>Reporter: Sascha Kehrli
>    Assignee: Markus Jelsma
>Priority: Trivial
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Found a trivial resource leak in .../util/DomainSuffixes.java, where an 
> InputStream is not closed:
> {code:java}
> InputStream input = 
> this.getClass().getClassLoader().getResourceAsStream(file);
> try {
>     new DomainSuffixesReader().read(this, input);
> } catch (Exception ex) {
> LOG.warn(StringUtils.stringifyException(ex));
> } {code}
>  
> instead of:
> {code:java}
> try (InputStream input = 
> this.getClass().getClassLoader().getResourceAsStream(file)) {
>     new DomainSuffixesReader().read(this, input);
> } catch (Exception ex) {
> LOG.warn(StringUtils.stringifyException(ex));
> } {code}
> Where the InputStream is automatically closed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3027) Trivial resource leak patch in DomainSuffixes.java

2024-01-19 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808614#comment-17808614
 ] 

Markus Jelsma commented on NUTCH-3027:
--

Thanks Sascha Kehrli!

Committed  {color:#00}85fea6e46..6b0455454  master -> master{color}

> Trivial resource leak patch in DomainSuffixes.java
> --
>
> Key: NUTCH-3027
> URL: https://issues.apache.org/jira/browse/NUTCH-3027
> Project: Nutch
>  Issue Type: Bug
>  Components: util
>Affects Versions: 1.20
>Reporter: Sascha Kehrli
>Assignee: Markus Jelsma
>Priority: Trivial
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Found a trivial resource leak in .../util/DomainSuffixes.java, where an 
> InputStream is not closed:
> {code:java}
> InputStream input = 
> this.getClass().getClassLoader().getResourceAsStream(file);
> try {
>     new DomainSuffixesReader().read(this, input);
> } catch (Exception ex) {
> LOG.warn(StringUtils.stringifyException(ex));
> } {code}
>  
> instead of:
> {code:java}
> try (InputStream input = 
> this.getClass().getClassLoader().getResourceAsStream(file)) {
>     new DomainSuffixesReader().read(this, input);
> } catch (Exception ex) {
> LOG.warn(StringUtils.stringifyException(ex));
> } {code}
> Where the InputStream is automatically closed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3027) Trivial resource leak patch in DomainSuffixes.java

2024-01-19 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-3027.
--
Fix Version/s: 1.20
   Resolution: Fixed

> Trivial resource leak patch in DomainSuffixes.java
> --
>
> Key: NUTCH-3027
> URL: https://issues.apache.org/jira/browse/NUTCH-3027
> Project: Nutch
>  Issue Type: Bug
>  Components: util
>Affects Versions: 1.20
>Reporter: Sascha Kehrli
>    Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 1.20
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Found a trivial resource leak in .../util/DomainSuffixes.java, where an 
> InputStream is not closed:
> {code:java}
> InputStream input = 
> this.getClass().getClassLoader().getResourceAsStream(file);
> try {
>     new DomainSuffixesReader().read(this, input);
> } catch (Exception ex) {
> LOG.warn(StringUtils.stringifyException(ex));
> } {code}
>  
> instead of:
> {code:java}
> try (InputStream input = 
> this.getClass().getClassLoader().getResourceAsStream(file)) {
>     new DomainSuffixesReader().read(this, input);
> } catch (Exception ex) {
> LOG.warn(StringUtils.stringifyException(ex));
> } {code}
> Where the InputStream is automatically closed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (NUTCH-3027) Trivial resource leak patch in DomainSuffixes.java

2024-01-19 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma reassigned NUTCH-3027:


Assignee: Markus Jelsma

> Trivial resource leak patch in DomainSuffixes.java
> --
>
> Key: NUTCH-3027
> URL: https://issues.apache.org/jira/browse/NUTCH-3027
> Project: Nutch
>  Issue Type: Bug
>  Components: util
>Affects Versions: 1.20
>Reporter: Sascha Kehrli
>    Assignee: Markus Jelsma
>Priority: Trivial
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Found a trivial resource leak in .../util/DomainSuffixes.java, where an 
> InputStream is not closed:
> {code:java}
> InputStream input = 
> this.getClass().getClassLoader().getResourceAsStream(file);
> try {
>     new DomainSuffixesReader().read(this, input);
> } catch (Exception ex) {
> LOG.warn(StringUtils.stringifyException(ex));
> } {code}
>  
> instead of:
> {code:java}
> try (InputStream input = 
> this.getClass().getClassLoader().getResourceAsStream(file)) {
>     new DomainSuffixesReader().read(this, input);
> } catch (Exception ex) {
> LOG.warn(StringUtils.stringifyException(ex));
> } {code}
> Where the InputStream is automatically closed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-1635) New crawldb sometimes ends up in current

2023-10-02 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17771023#comment-17771023
 ] 

Markus Jelsma commented on NUTCH-1635:
--

Good point! No, we haven't seen this behaviour for the past decade or so. Let's 
close it!

Danke!

> New crawldb sometimes ends up in current
> 
>
> Key: NUTCH-1635
> URL: https://issues.apache.org/jira/browse/NUTCH-1635
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.7
>    Reporter: Markus Jelsma
>Priority: Major
>
> In some weird cases the newly created crawldb by updatedb ends up in 
> crawl/crawldb/current//. So instead of replacing current/, it ends up 
> inside current/! This causes the generator to fail.
> It's impossible to reliably reproduce the problem. It only happened a couple 
> of times in the last few years.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-1635) New crawldb sometimes ends up in current

2023-10-02 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-1635.

Resolution: Not A Problem

> New crawldb sometimes ends up in current
> 
>
> Key: NUTCH-1635
> URL: https://issues.apache.org/jira/browse/NUTCH-1635
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.7
>    Reporter: Markus Jelsma
>Priority: Major
>
> In some weird cases the newly created crawldb by updatedb ends up in 
> crawl/crawldb/current//. So instead of replacing current/, it ends up 
> inside current/! This causes the generator to fail.
> It's impossible to reliably reproduce the problem. It only happened a couple 
> of times in the last few years.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3007) Fix impossible casts

2023-09-28 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17769989#comment-17769989
 ] 

Markus Jelsma commented on NUTCH-3007:
--

+1 yes!

> Fix impossible casts
> 
>
> Key: NUTCH-3007
> URL: https://issues.apache.org/jira/browse/NUTCH-3007
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Spotbugs reports two occurrences of
>   Impossible cast from java.util.ArrayList to String[] in 
> org.apache.nutch.fetcher.Fetcher.run(Map, String)
> Both were introduced later into the {{run(Map args, String 
> crawlId)}} method and obviously never used (would throw a 
> ClassCastException). The code blocks should be removed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2852) Method invokes System.exit(...) 9 bugs

2023-09-28 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17769988#comment-17769988
 ] 

Markus Jelsma commented on NUTCH-2852:
--

Seems just fine for these files +1

> Method invokes System.exit(...) 9 bugs
> --
>
> Key: NUTCH-2852
> URL: https://issues.apache.org/jira/browse/NUTCH-2852
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.18
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> org.apache.nutch.indexer.IndexingFiltersChecker since first historized release
> In class org.apache.nutch.indexer.IndexingFiltersChecker
> In method org.apache.nutch.indexer.IndexingFiltersChecker.run(String[])
> At IndexingFiltersChecker.java:[line 96]
> Another occurrence at IndexingFiltersChecker.java:[line 129]
> org.apache.nutch.indexer.IndexingFiltersChecker.run(String[]) invokes 
> System.exit(...), which shuts down the entire virtual machine
> Invoking System.exit shuts down the entire Java virtual machine. This should 
> only been done when it is appropriate. Such calls make it hard or impossible 
> for your code to be invoked by other code. Consider throwing a 
> RuntimeException instead.
> Also occurs in
>org.apache.nutch.net.URLFilterChecker since first historized release
>org.apache.nutch.net.URLNormalizerChecker since first historized release
>org.apache.nutch.parse.ParseSegment since first historized release
>org.apache.nutch.parse.ParserChecker since first historized release
>org.apache.nutch.service.NutchServer since first historized release
>org.apache.nutch.tools.CommonCrawlDataDumper since first historized release
>org.apache.nutch.tools.DmozParser since first historized release
>org.apache.nutch.util.AbstractChecker since first historized release 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2023-09-18 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766306#comment-17766306
 ] 

Markus Jelsma commented on NUTCH-2978:
--

Thanks for picking it up. I am very happy this one is resolved now. Thanks 
Sebastian for testing!

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>    Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-2978-1.patch, NUTCH-2978-2.patch, 
> NUTCH-2978-3.patch, NUTCH-2978-any23.patch, NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2023-09-13 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764699#comment-17764699
 ] 

Markus Jelsma commented on NUTCH-2978:
--

You managed to get it up and running, as well when deployed on Hadoop? This 
ticket almost drove me to tears and despair :D

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>    Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978-1.patch, NUTCH-2978-2.patch, 
> NUTCH-2978-3.patch, NUTCH-2978-any23.patch, NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3000) protocol-selenium returns only the body,strips off the element

2023-09-13 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764697#comment-17764697
 ] 

Markus Jelsma commented on NUTCH-3000:
--

Yes, this is a bit odd indeed. +1

> protocol-selenium returns only the body,strips off the  element
> --
>
> Key: NUTCH-3000
> URL: https://issues.apache.org/jira/browse/NUTCH-3000
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Reporter: Tim Allison
>Priority: Major
>
> The selenium protocol returns only the body portion of the html, which means 
> that neither the title nor the other page metadata in the  section 
> gets extracted.
> {noformat}
> String innerHtml = driver.findElement(By.tagName("body"))
> .getAttribute("innerHTML");
> {noformat}
> We should return the full html, no?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2999) Update Lucene version to latest 8.x

2023-08-30 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760522#comment-17760522
 ] 

Markus Jelsma commented on NUTCH-2999:
--

Seems fine +1

> Update Lucene version to latest 8.x
> ---
>
> Key: NUTCH-2999
> URL: https://issues.apache.org/jira/browse/NUTCH-2999
> Project: Nutch
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> It may be the way that I'm loading the project, but, for me, Intellij really 
> does not like the Lucene version conflict between {{scoring-similarity}} and 
> the OpenSearch/Elasticsearch modules.
> Can we bump Lucene to the latest 8.11.2 throughout?
> PR for review incoming.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-28 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738191#comment-17738191
 ] 

Markus Jelsma commented on NUTCH-2993:
--

To be honest, i am not too happy with the implementation like this. Ideally we 
would regex all outlinks, but that will be even more costly. The crawler still 
ends up in bad sections of the site and further on the www, but with low depth 
settings, it is manageable.

> ScoringDepth plugin to skip depth check based on URL Pattern
> 
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>    Reporter: Markus Jelsma
>    Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15-1.patch, NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites.
> This patch overrides maxDepth for outlinks of URLs matching a configured 
> pattern. URL not matching the pattern get the default max depth value 
> configured.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-28 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738047#comment-17738047
 ] 

Markus Jelsma commented on NUTCH-2993:
--

Thanks Sebastian!
 # changed the checks again.
 # check for empty/non-configured pattern in place
 # props added to config
 # try/catch in place
 # typo removed

Ran a few long crawls with just over a hundred domains. I changed the checks 
again. Now the maxDepth resets if it does NOT match/find the pattern. There was 
still a possibility of sitemap-like pages being passed an overridden maxDepth, 
due to a linking page matching the pattern, and then a whole site got crawled 
anyway.

 

> ScoringDepth plugin to skip depth check based on URL Pattern
> 
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>    Reporter: Markus Jelsma
>    Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15-1.patch, NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites.
> This patch overrides maxDepth for outlinks of URLs matching a configured 
> pattern. URL not matching the pattern get the default max depth value 
> configured.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-28 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2993:
-
Attachment: NUTCH-2993-1.15-1.patch

> ScoringDepth plugin to skip depth check based on URL Pattern
> 
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>    Reporter: Markus Jelsma
>    Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15-1.patch, NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites.
> This patch overrides maxDepth for outlinks of URLs matching a configured 
> pattern. URL not matching the pattern get the default max depth value 
> configured.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-28 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2993:
-
Attachment: (was: NUTCH-2993.patch)

> ScoringDepth plugin to skip depth check based on URL Pattern
> 
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>    Reporter: Markus Jelsma
>    Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites.
> This patch overrides maxDepth for outlinks of URLs matching a configured 
> pattern. URL not matching the pattern get the default max depth value 
> configured.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-28 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2993:
-
Attachment: NUTCH-2993.patch

> ScoringDepth plugin to skip depth check based on URL Pattern
> 
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>    Reporter: Markus Jelsma
>    Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites.
> This patch overrides maxDepth for outlinks of URLs matching a configured 
> pattern. URL not matching the pattern get the default max depth value 
> configured.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-28 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2993:
-
Description: 
We do not want some crawl to go deep and broad, but instead focus it on a 
narrow section of sites.


This patch overrides maxDepth for outlinks of URLs matching a configured 
pattern. URL not matching the pattern get the default max depth value 
configured.

  was:
We do not want some crawl to go deep and broad, but instead focus it on a 
narrow section of sites. This patch skips the depth check if the current URL 
matches some regular expression.

 

Initially we tried to set a custom maxDepth based on a Pattern match, but this 
didn't work. The crawler still managed to creep too deep due to having links 
everywhere.


> ScoringDepth plugin to skip depth check based on URL Pattern
> 
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>    Reporter: Markus Jelsma
>    Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites.
> This patch overrides maxDepth for outlinks of URLs matching a configured 
> pattern. URL not matching the pattern get the default max depth value 
> configured.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2993:
-
Attachment: NUTCH-2993-1.15.patch

> ScoringDepth plugin to skip depth check based on URL Pattern
> 
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>    Reporter: Markus Jelsma
>    Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites. This patch skips the depth check if the current URL 
> matches some regular expression.
>  
> Initially we tried to set a custom maxDepth based on a Pattern match, but 
> this didn't work. The crawler still managed to creep too deep due to having 
> links everywhere.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2993:
-
Attachment: (was: NUTCH-2993-1.15-1.patch)

> ScoringDepth plugin to skip depth check based on URL Pattern
> 
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>    Reporter: Markus Jelsma
>    Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites. This patch skips the depth check if the current URL 
> matches some regular expression.
>  
> Initially we tried to set a custom maxDepth based on a Pattern match, but 
> this didn't work. The crawler still managed to creep too deep due to having 
> links everywhere.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2993:
-
Attachment: (was: NUTCH-2993-1.15.patch)

> ScoringDepth plugin to skip depth check based on URL Pattern
> 
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>    Reporter: Markus Jelsma
>    Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15-1.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites. This patch skips the depth check if the current URL 
> matches some regular expression.
>  
> Initially we tried to set a custom maxDepth based on a Pattern match, but 
> this didn't work. The crawler still managed to creep too deep due to having 
> links everywhere.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2993:
-
Attachment: NUTCH-2993-1.15-1.patch

> ScoringDepth plugin to skip depth check based on URL Pattern
> 
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>    Reporter: Markus Jelsma
>    Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15-1.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites. This patch skips the depth check if the current URL 
> matches some regular expression.
>  
> Initially we tried to set a custom maxDepth based on a Pattern match, but 
> this didn't work. The crawler still managed to creep too deep due to having 
> links everywhere.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2993:
-
Description: 
We do not want some crawl to go deep and broad, but instead focus it on a 
narrow section of sites. This patch skips the depth check if the current URL 
matches some regular expression.

 

Initially we tried to set a custom maxDepth based on a Pattern match, but this 
didn't work. The crawler still managed to creep too deep due to having links 
everywhere.

  was:We do not want some crawl to go deep and broad, but instead focus it on a 
narrow section of sites. This patch allows for a overridden max depth if the 
current URL matches against a Pattern. If find(), all outlinks are given a new 
max depth.


> ScoringDepth plugin to skip depth check based on URL Pattern
> 
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>    Reporter: Markus Jelsma
>    Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites. This patch skips the depth check if the current URL 
> matches some regular expression.
>  
> Initially we tried to set a custom maxDepth based on a Pattern match, but 
> this didn't work. The crawler still managed to creep too deep due to having 
> links everywhere.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2993:
-
Summary: ScoringDepth plugin to skip depth check based on URL Pattern  
(was: ScoringDepth plugin to override maxDepth based on URL Pattern)

> ScoringDepth plugin to skip depth check based on URL Pattern
> 
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>    Reporter: Markus Jelsma
>    Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites. This patch allows for a overridden max depth if the 
> current URL matches against a Pattern. If find(), all outlinks are given a 
> new max depth.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to override maxDepth based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2993:
-
Attachment: NUTCH-2993-1.15.patch

> ScoringDepth plugin to override maxDepth based on URL Pattern
> -
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>    Reporter: Markus Jelsma
>    Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites. This patch allows for a overridden max depth if the 
> current URL matches against a Pattern. If find(), all outlinks are given a 
> new max depth.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to override maxDepth based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2993:
-
Attachment: (was: NUTCH-2993-1.15.patch)

> ScoringDepth plugin to override maxDepth based on URL Pattern
> -
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>    Reporter: Markus Jelsma
>    Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites. This patch allows for a overridden max depth if the 
> current URL matches against a Pattern. If find(), all outlinks are given a 
> new max depth.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to override maxDepth based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2993:
-
Attachment: (was: NUTCH-2993-1.15.patch)

> ScoringDepth plugin to override maxDepth based on URL Pattern
> -
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>    Reporter: Markus Jelsma
>    Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites. This patch allows for a overridden max depth if the 
> current URL matches against a Pattern. If find(), all outlinks are given a 
> new max depth.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to override maxDepth based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2993:
-
Attachment: NUTCH-2993-1.15.patch

> ScoringDepth plugin to override maxDepth based on URL Pattern
> -
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>    Reporter: Markus Jelsma
>    Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites. This patch allows for a overridden max depth if the 
> current URL matches against a Pattern. If find(), all outlinks are given a 
> new max depth.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2993) ScoringDepth plugin to override maxDepth based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17730535#comment-17730535
 ] 

Markus Jelsma commented on NUTCH-2993:
--

Here's a simple patch against Nutch 1.15. Will patch for master later.

> ScoringDepth plugin to override maxDepth based on URL Pattern
> -
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>    Reporter: Markus Jelsma
>    Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites. This patch allows for a overridden max depth if the 
> current URL matches against a Pattern. If find(), all outlinks are given a 
> new max depth.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to override maxDepth based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2993:
-
Attachment: NUTCH-2993-1.15.patch

> ScoringDepth plugin to override maxDepth based on URL Pattern
> -
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>    Reporter: Markus Jelsma
>    Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites. This patch allows for a overridden max depth if the 
> current URL matches against a Pattern. If find(), all outlinks are given a 
> new max depth.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-2993) ScoringDepth plugin to override maxDepth based on URL Pattern

2023-06-08 Thread Markus Jelsma (Jira)
Markus Jelsma created NUTCH-2993:


 Summary: ScoringDepth plugin to override maxDepth based on URL 
Pattern
 Key: NUTCH-2993
 URL: https://issues.apache.org/jira/browse/NUTCH-2993
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.20


We do not want some crawl to go deep and broad, but instead focus it on a 
narrow section of sites. This patch allows for a overridden max depth if the 
current URL matches against a Pattern. If find(), all outlinks are given a new 
max depth.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2985) Disable plugin urlfilter-validator by default

2023-02-24 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693291#comment-17693291
 ] 

Markus Jelsma commented on NUTCH-2985:
--

+1

> Disable plugin urlfilter-validator by default
> -
>
> Key: NUTCH-2985
> URL: https://issues.apache.org/jira/browse/NUTCH-2985
> Project: Nutch
>  Issue Type: Bug
>  Components: configuration, urlfilter
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> The plugin urlfilter-validator is activated by default (in nutch-default.xml) 
> but has two major issues which may confuse users of Nutch:
> - single-part domain names (localhost, etc.) are not allowed (NUTCH-2973)
> - IPv6 host names are rejected as invalid (NUTCH-2705)
> What about disabling it by default to overcome these issues?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Upgrading Selenium

2023-01-20 Thread Markus Jelsma
> There must be a way, some how, some time.

There isn't:
https://github.com/seleniumhq/selenium-google-code-issue-archive/issues/141

Op do 19 jan. 2023 om 15:23 schreef Markus Jelsma <
markus.jel...@openindex.io>:

> > This makes some sense if you do not know anything about the URL.
> > - a HEAD request could do almost the same
> > - often one knows whether there are only HTML pages or also PDFs, zip
> files,
> >and other stuff not suitable for Selenium. Could make the HEAD request
> >optional.
>
> Ah crap, i forgot about that. With Selenium, it is still not possible to
> get the HTTP headers of the most recent request. And when requesting the
> page source, it will either return nothing, or the previous 'successful'
> call when requesting a non-text MIME-type URL.
>
> Besides doing a HEAD request first, there is no neat way to work with
> non-text/html URLs as we can using HtmlUnit. That at least returns the
> headers and the raw binary data.
>
> There must be a way, some how, some time.
>
> Thanks,
> Markus
>
> Op do 19 jan. 2023 om 11:38 schreef Sebastian Nagel <
> wastl.na...@googlemail.com>:
>
>> Hi Kamil, hi Markus,
>>
>> upgrading the Selenium plugin is very appreciated!
>>
>>  > Besides that, the plugin also needs some overhaul.
>>
>> Definitely.
>>
>>  > It currently first downloads the URL with HttpClient, and then,
>> depending on
>>  > MIME-type, it may or may not forward the URL to Selenium so it can be
>>  > downloaded again.
>>
>> This makes some sense if you do not know anything about the URL.
>> - a HEAD request could do almost the same
>> - often one knows whether there are only HTML pages or also PDFs, zip
>> files,
>>and other stuff not suitable for Selenium. Could make the HEAD request
>>optional.
>>
>>  > merging the lib-selenium plugin with the protocol-selenium plugin
>>
>> I guess lib-selenium is to share common components between
>> protocol-selenium and
>> protocol-interactiveselenium. Maybe merge all three? Or skip
>> interactiveselenium
>> for now.
>>
>> ~Sebastian
>>
>> On 1/17/23 19:56, Markus Jelsma wrote:
>> > Hello Kamil,
>> >
>> > Yes, the plugin needs some upgrading indeed. We use a modern version of
>> it
>> > elsewhere and it works really well, at least better than HtmlUnit.
>> >
>> > Besides that, the plugin also needs some overhaul. It currently first
>> downloads
>> > the URL with HttpClient, and then, depending on MIME-type, it may or
>> may not
>> > forward the URL to Selenium so it can be downloaded again.
>> >
>> > There is a lot of code in the plugin that should be removed. I would
>> also opt
>> > for merging the lib-selenium plugin with the protocol-selenium plugin.
>> There is
>> > no obvious need for having it separated.
>> >
>> > These can be, of course, separate tasks.
>> >
>> > Regards,
>> > Markus
>> >
>> > Op di 17 jan. 2023 om 17:49 schreef Kamil Mroczek :
>> >
>> > Hello,
>> >
>> > I am sending a message to inquire whether I should submit a patch
>> which
>> > updates selenium to the latest version. Although it is a major
>> version
>> > upgrade to the library, very few code changes were needed to update.
>> >
>> > For a preview of the changes I made you can look here
>> > <
>> https://github.com/Elio-Earth/nutch/commit/9960f14bce0f0d6cebc406556a298a7c8c2e6b9f>.
>> Although not used in the code anymore (it was commented out), PhantomJS
>> support has been removed from Selenium in the latest version. The commit
>> also removes Opera since it was commented out but I can leave that in if
>> needed. The build and tests pass. I have been using the Chrome driver
>> successfully with it and would just need to run a quick test with Firefox
>> to make sure it works too.
>> >
>> > I have only been using Nutch for about a month but have spent quite
>> a bit of
>> > time looking over different parts of the code to understand how to
>> configure
>> > it and change it.
>> >
>> > Kamil
>> >
>>
>


Re: Upgrading Selenium

2023-01-19 Thread Markus Jelsma
> This makes some sense if you do not know anything about the URL.
> - a HEAD request could do almost the same
> - often one knows whether there are only HTML pages or also PDFs, zip
files,
>and other stuff not suitable for Selenium. Could make the HEAD request
>optional.

Ah crap, i forgot about that. With Selenium, it is still not possible to
get the HTTP headers of the most recent request. And when requesting the
page source, it will either return nothing, or the previous 'successful'
call when requesting a non-text MIME-type URL.

Besides doing a HEAD request first, there is no neat way to work with
non-text/html URLs as we can using HtmlUnit. That at least returns the
headers and the raw binary data.

There must be a way, some how, some time.

Thanks,
Markus

Op do 19 jan. 2023 om 11:38 schreef Sebastian Nagel <
wastl.na...@googlemail.com>:

> Hi Kamil, hi Markus,
>
> upgrading the Selenium plugin is very appreciated!
>
>  > Besides that, the plugin also needs some overhaul.
>
> Definitely.
>
>  > It currently first downloads the URL with HttpClient, and then,
> depending on
>  > MIME-type, it may or may not forward the URL to Selenium so it can be
>  > downloaded again.
>
> This makes some sense if you do not know anything about the URL.
> - a HEAD request could do almost the same
> - often one knows whether there are only HTML pages or also PDFs, zip
> files,
>and other stuff not suitable for Selenium. Could make the HEAD request
>optional.
>
>  > merging the lib-selenium plugin with the protocol-selenium plugin
>
> I guess lib-selenium is to share common components between
> protocol-selenium and
> protocol-interactiveselenium. Maybe merge all three? Or skip
> interactiveselenium
> for now.
>
> ~Sebastian
>
> On 1/17/23 19:56, Markus Jelsma wrote:
> > Hello Kamil,
> >
> > Yes, the plugin needs some upgrading indeed. We use a modern version of
> it
> > elsewhere and it works really well, at least better than HtmlUnit.
> >
> > Besides that, the plugin also needs some overhaul. It currently first
> downloads
> > the URL with HttpClient, and then, depending on MIME-type, it may or may
> not
> > forward the URL to Selenium so it can be downloaded again.
> >
> > There is a lot of code in the plugin that should be removed. I would
> also opt
> > for merging the lib-selenium plugin with the protocol-selenium plugin.
> There is
> > no obvious need for having it separated.
> >
> > These can be, of course, separate tasks.
> >
> > Regards,
> > Markus
> >
> > Op di 17 jan. 2023 om 17:49 schreef Kamil Mroczek :
> >
> > Hello,
> >
> > I am sending a message to inquire whether I should submit a patch
> which
> > updates selenium to the latest version. Although it is a major
> version
> > upgrade to the library, very few code changes were needed to update.
> >
> > For a preview of the changes I made you can look here
> > <
> https://github.com/Elio-Earth/nutch/commit/9960f14bce0f0d6cebc406556a298a7c8c2e6b9f>.
> Although not used in the code anymore (it was commented out), PhantomJS
> support has been removed from Selenium in the latest version. The commit
> also removes Opera since it was commented out but I can leave that in if
> needed. The build and tests pass. I have been using the Chrome driver
> successfully with it and would just need to run a quick test with Firefox
> to make sure it works too.
> >
> > I have only been using Nutch for about a month but have spent quite
> a bit of
> > time looking over different parts of the code to understand how to
> configure
> > it and change it.
> >
> > Kamil
> >
>


Re: Upgrading Selenium

2023-01-17 Thread Markus Jelsma
Hello Kamil,

Yes, the plugin needs some upgrading indeed. We use a modern version of it
elsewhere and it works really well, at least better than HtmlUnit.

Besides that, the plugin also needs some overhaul. It currently first
downloads the URL with HttpClient, and then, depending on MIME-type, it may
or may not forward the URL to Selenium so it can be downloaded again.

There is a lot of code in the plugin that should be removed. I would also
opt for merging the lib-selenium plugin with the protocol-selenium plugin.
There is no obvious need for having it separated.

These can be, of course, separate tasks.

Regards,
Markus

Op di 17 jan. 2023 om 17:49 schreef Kamil Mroczek :

> Hello,
>
> I am sending a message to inquire whether I should submit a patch which
> updates selenium to the latest version. Although it is a major version
> upgrade to the library, very few code changes were needed to update.
>
> For a preview of the changes I made you can look here
> .
> Although not used in the code anymore (it was commented out), PhantomJS
> support has been removed from Selenium in the latest version. The commit
> also removes Opera since it was commented out but I can leave that in if
> needed. The build and tests pass. I have been using the Chrome driver
> successfully with it and would just need to run a quick test with Firefox
> to make sure it works too.
>
> I have only been using Nutch for about a month but have spent quite a bit
> of time looking over different parts of the code to understand how to
> configure it and change it.
>
> Kamil
>


[jira] [Commented] (NUTCH-2974) Ant build fails with "Unparseable date" on certain platforms

2023-01-16 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677435#comment-17677435
 ] 

Markus Jelsma commented on NUTCH-2974:
--

Sounds like a nice solution for this obscure bug +1

> Ant build fails with "Unparseable date" on certain platforms
> 
>
> Key: NUTCH-2974
> URL: https://issues.apache.org/jira/browse/NUTCH-2974
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> When touching the configuration templates the ant build fails on certain 
> platforms, see NUTCH-2512 and recently by [Kamil Mroczek on the users 
> list|https://lists.apache.org/thread/dc36ofc6kvvx3fxlqbnzqdcp73yjcj8m], 
> including a fix.
> However, we should also consider removing the "touch" action if it's not 
> clear what the purpose of it is - it's there since the initial import of the 
> Nutch source code to the Apache repository. Could be obsolete now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2634) Some links marked as "nofollow" are followed anyway.

2023-01-06 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17655383#comment-17655383
 ] 

Markus Jelsma commented on NUTCH-2634:
--

+1

> Some links marked as "nofollow" are followed anyway.
> 
>
> Key: NUTCH-2634
> URL: https://issues.apache.org/jira/browse/NUTCH-2634
> Project: Nutch
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Fix For: 1.20
>
>
> In order to check if an outlink in an  tag can be followed, nutch checks 
> whether the value of its rel attribute is the exact string string "nofollow".
> However, [the rel attribute can contain a list of link 
> types|https://html.spec.whatwg.org/multipage/links.html#attr-hyperlink-rel], 
> all of which should be respected.
> So nutch rightfully doesn't follow a link like:
> {code:html}
> DO NOT FOLLOW THIS LINK
> {code}
> but wrongfully follows :
> {code:html}
> DO NOT FOLLOW THIS 
> LINK
> {code}
> Because of the code duplication in nutch's html parsers, this should be fixed 
> in two places:
> # 
> [parse/html/DOMContentUtils.java|https://github.com/apache/nutch/blob/3ada351a26b653b307c19e25b17e0e611a9bd59a/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java#L437]
> # 
> [parse/tika/DOMContentUtils.java|https://github.com/apache/nutch/blob/f02110f42c53e77450835776cf41f22c23f030ec/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java#L410]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-22 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17651243#comment-17651243
 ] 

Markus Jelsma edited comment on NUTCH-2978 at 12/22/22 11:33 AM:
-

Ah nope, this is not it. Parse-tika throws lots of errors and stack traces, 
although it does work. We now get:

{color:#00}java.util.ServiceConfigurationError: 
org.apache.logging.log4j.spi.Provider: 
org.apache.logging.log4j.core.impl.Log4jProvider not a subtype

{color}

{color:#00}There are no multiple versions of the same logging JARs anywhere 
on the classpath.{color}


was (Author: markus17):
Ah nope, this is not it. Parse-tika throws lots of errors and stack traces, 
although it does work. We now get:

{color:#00}java.util.ServiceConfigurationError: 
org.apache.logging.log4j.spi.Provider: 
org.apache.logging.log4j.core.impl.Log4jProvider not a subtype{color}

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>    Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978-1.patch, NUTCH-2978-2.patch, 
> NUTCH-2978-3.patch, NUTCH-2978-any23.patch, NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-22 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17651243#comment-17651243
 ] 

Markus Jelsma commented on NUTCH-2978:
--

Ah nope, this is not it. Parse-tika throws lots of errors and stack traces, 
although it does work. We now get:

{color:#00}java.util.ServiceConfigurationError: 
org.apache.logging.log4j.spi.Provider: 
org.apache.logging.log4j.core.impl.Log4jProvider not a subtype{color}

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>    Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978-1.patch, NUTCH-2978-2.patch, 
> NUTCH-2978-3.patch, NUTCH-2978-any23.patch, NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-16 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17648636#comment-17648636
 ] 

Markus Jelsma commented on NUTCH-2978:
--

New patch now makes sure there is a log4j 2.19 in tika and mentioned in its 
plugin.xml, otherwise above will happen. Now i am not sure the other plugins 
are still ok.

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>    Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978-1.patch, NUTCH-2978-2.patch, 
> NUTCH-2978-3.patch, NUTCH-2978-any23.patch, NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-16 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2978:
-
Attachment: NUTCH-2978-3.patch

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>    Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978-1.patch, NUTCH-2978-2.patch, 
> NUTCH-2978-3.patch, NUTCH-2978-any23.patch, NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-16 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17648633#comment-17648633
 ] 

Markus Jelsma commented on NUTCH-2978:
--

Ok, i also wanted to get rid of loose log4j libs. There was still one in any23 
and parse-tika. When removing the lib from parse-tika, lots of bad things 
happen.
{code:java}
22/12/16 13:36:03 WARN ooxml.OPCPackageDetector: Unable to load 
org.apache.tika.detect.microsoft.ooxml.OPCPackageDetector
java.lang.NoClassDefFoundError: org/apache/logging/log4j/LogManager
        at org.apache.poi.ooxml.POIXMLRelation.(POIXMLRelation.java:54)
        at 
org.apache.tika.detect.microsoft.ooxml.OPCPackageDetector.(OPCPackageDetector.java:106)
        at 
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
        at 
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at 
java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at 
java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
        at java.base/java.lang.Class.newInstance(Class.java:584)
        at 
org.apache.tika.utils.ServiceLoaderUtils.newInstance(ServiceLoaderUtils.java:80)
        at 
org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:345)
        at 
org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:312)
        at 
org.apache.tika.detect.zip.DefaultZipContainerDetector.(DefaultZipContainerDetector.java:85)
        at 
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
        at 
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at 
java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at 
java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
        at 
org.apache.tika.utils.ServiceLoaderUtils.newInstance(ServiceLoaderUtils.java:78)
        at 
org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:345)
        at 
org.apache.tika.detect.DefaultDetector.getDefaultDetectors(DefaultDetector.java:90)
        at 
org.apache.tika.detect.DefaultDetector.(DefaultDetector.java:50)
        at 
org.apache.tika.detect.DefaultDetector.(DefaultDetector.java:55)
        at 
org.apache.tika.config.TikaConfig.getDefaultDetector(TikaConfig.java:264)
        at 
org.apache.tika.config.TikaConfig$DetectorXmlLoader.createDefault(TikaConfig.java:1017)
        at 
org.apache.tika.config.TikaConfig$DetectorXmlLoader.createDefault(TikaConfig.java:975)
        at 
org.apache.tika.config.TikaConfig$XmlLoader.loadOverall(TikaConfig.java:630)
        at org.apache.tika.config.TikaConfig.(TikaConfig.java:155)
        at org.apache.tika.config.TikaConfig.(TikaConfig.java:145)
        at org.apache.tika.config.TikaConfig.(TikaConfig.java:120)
        at org.apache.nutch.parse.tika.TikaParser.setConf(TikaParser.java:276)
        at 
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:177)
        at 
org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:136)
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:75)
        at 
org.apache.nutch.indexer.IndexingFiltersChecker.process(IndexingFiltersChecker.java:245)
        at 
org.apache.nutch.util.AbstractChecker.processSingle(AbstractChecker.java:87)
        at 
org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:136)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:81)
        at 
org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:316)
        at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
Caused by: java.lang.ClassNotFoundException: org.apache.logging.log4j.LogManager
        at 
java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
        at 
java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
        at 
org.apache.nutch.plugin.PluginClassLoader.loadClassFromSystem(PluginClassLoader.java:105)
        at 
org.apache.nutch.plugin.PluginClassLoader.loadClassFromParent(PluginClassLoader.java:93

[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-16 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17648625#comment-17648625
 ] 

Markus Jelsma commented on NUTCH-2978:
--

Patch now includes Sebastian's patch, and actually contains the upgrade from 
old slf4j to the new 2.0.6. Tested on Hadoop 3.3.4 cluster with a parsing 
fetcher. This went just fine.

-I must admist that those slf4js and jcl-over-slf remaining in the plugins do 
bother me to some degree.-

New patch now includes exclusions to get rid of all of them.

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>    Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978-1.patch, NUTCH-2978-2.patch, 
> NUTCH-2978-any23.patch, NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-16 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2978:
-
Attachment: NUTCH-2978-2.patch

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>    Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978-1.patch, NUTCH-2978-2.patch, 
> NUTCH-2978-any23.patch, NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-16 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2978:
-
Attachment: (was: NUTCH-2978-1.patch)

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>    Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978-1.patch, NUTCH-2978-any23.patch, 
> NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-16 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2978:
-
Attachment: NUTCH-2978-1.patch

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>    Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978-1.patch, NUTCH-2978-any23.patch, 
> NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-16 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2978:
-
Attachment: NUTCH-2978-1.patch

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>    Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978-1.patch, NUTCH-2978-any23.patch, 
> NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-15 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17648060#comment-17648060
 ] 

Markus Jelsma commented on NUTCH-2978:
--

Ah yes, thanks! I am not sure if a 'solution' will come from Tika, that 
specific package seems to be shaded in all versions between 2.3.0 and 2.6.0. 
But, we, ASF Nutch, do not depend on it so we are good.

Patched like this, Nutch will fetch/parse just fine when running on Hadoop. I 
did get this when doing an indexchecker using the job file:

{color:#00}ERROR StatusLogger Log4j2 could not find a logging 
implementation. Please add log4j-core to the classpath. Using SimpleLogger to 
log to the console...{color}


 

{color:#00}However, logging worked just fine.{color}

 

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>    Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978-any23.patch, NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-13 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646681#comment-17646681
 ] 

Markus Jelsma commented on NUTCH-2978:
--

About the slf issues,

Somewhere another slf4j jar was lurking in the job file, but i couldn't find it 
for a long while. Until i saw there was a slf4j jar packaged within the 
tika-parser-scientific-package! I got rid of it, then got a xerces/xml-apis 
error, which i then also excluded. Now there are many other errors.

Something to look out for when upgrading Tika. But for some reason, although we 
are using the same Tika version, that specific package does not appear as a 
dependency of Tika in in Nutch' vanilla. That may change later.

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>    Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978-any23.patch, NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-2924) Generate maxCount expr evaluated only once

2022-12-12 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-2924.
--
Resolution: Fixed

{color:#00}To https://gitbox.apache.org/repos/asf/nutch.git {color}
  d806aa450..7d3900450  master -> master

> Generate maxCount expr evaluated only once
> --
>
> Key: NUTCH-2924
> URL: https://issues.apache.org/jira/browse/NUTCH-2924
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.16
>    Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-2924-1.patch, NUTCH-2924-2.patch, 
> NUTCH-2924-3.patch, NUTCH-2924-4.patch, NUTCH-2924-5.patch, NUTCH-2924.patch
>
>
> The generate.maxCount expression is evaluated only once in the generator's 
> reducer, instead, it must be set once per host.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-08 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644838#comment-17644838
 ] 

Markus Jelsma edited comment on NUTCH-2978 at 12/8/22 2:34 PM:
---

Ah, well. I also tried a Tika parsing fetcher of a vanilla 1.20 Nutch with just 
this patch, and the generator patch. It works!

Not sure why our parser stuff fails, but at least Nutch' stuff is working! But 
we both use a LoggerFactory.getLogger invocation, the original TikaParser 
invocation works, mine doesn't.


was (Author: markus17):
Ah, well. I also tried a Tika parsing fetcher of a vanilla 1.20 Nutch with just 
this patch, and the generator patch. It works!

Not sure why our parser stuff fails, but at least Nutch' stuff is working!

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>    Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-08 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644838#comment-17644838
 ] 

Markus Jelsma commented on NUTCH-2978:
--

Ah, well. I also tried a Tika parsing fetcher of a vanilla 1.20 Nutch with just 
this patch, and the generator patch. It works!

Not sure why our parser stuff fails, but at least Nutch' stuff is working!

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>    Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-08 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644825#comment-17644825
 ] 

Markus Jelsma edited comment on NUTCH-2978 at 12/8/22 2:12 PM:
---

This morning i saw one of our internal projects spewing the same error as 
any23, it was quickly remedied by upgrading a dependency further down the line. 
Not sure if this will go as easy with the any23 plugin, i'll take a look

Regarding running on Hadoop, I just ran a patched 1.20 CrawldbReader job on a 
3.3.4 cluster, i ran flawless! Encouranged by the result i quickly ran a 
generate, followed by a fetch. The fetch failed due to LinkageError in our 
parser plugin, similar as parse-tika. Too bad.

A local indexchecker runs fine, an indexchecker using a job file fails with the 
same error.

Removing all reload4j references is not solving it, as expected. Not sure what 
to do now.


was (Author: markus17):
This morning i saw one of our internal projects spewing the same error as 
any23, it was quickly remedied by upgrading a dependency further down the line. 
Not sure if this will go as easy with the any23 plugin, i'll take a look

Regarding running on Hadoop, I just ran a patched 1.20 CrawldbReader job on a 
3.3.4 cluster, i ran flawless! Encouranged by the result i quickly ran a 
generate, followed by a fetch. The fetch failed due to LinkageError in our 
parser plugin, similar as parse-tika. Too bad.

A local indexchecker runs fine, an indexchecker using a job file fails with the 
same error.

 

 

 

 

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>    Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-08 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644825#comment-17644825
 ] 

Markus Jelsma commented on NUTCH-2978:
--

This morning i saw one of our internal projects spewing the same error as 
any23, it was quickly remedied by upgrading a dependency further down the line. 
Not sure if this will go as easy with the any23 plugin, i'll take a look

Regarding running on Hadoop, I just ran a patched 1.20 CrawldbReader job on a 
3.3.4 cluster, i ran flawless! Encouranged by the result i quickly ran a 
generate, followed by a fetch. The fetch failed due to LinkageError in our 
parser plugin, similar as parse-tika. Too bad.

A local indexchecker runs fine, an indexchecker using a job file fails with the 
same error.

 

 

 

 

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>    Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2924) Generate maxCount expr evaluated only once

2022-12-08 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644808#comment-17644808
 ] 

Markus Jelsma commented on NUTCH-2924:
--

Here's the proper patch, finally.

> Generate maxCount expr evaluated only once
> --
>
> Key: NUTCH-2924
> URL: https://issues.apache.org/jira/browse/NUTCH-2924
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.16
>    Reporter: Markus Jelsma
>    Assignee: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-2924-1.patch, NUTCH-2924-2.patch, 
> NUTCH-2924-3.patch, NUTCH-2924-4.patch, NUTCH-2924-5.patch, NUTCH-2924.patch
>
>
> The generate.maxCount expression is evaluated only once in the generator's 
> reducer, instead, it must be set once per host.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2924) Generate maxCount expr evaluated only once

2022-12-08 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2924:
-
Attachment: NUTCH-2924-5.patch

> Generate maxCount expr evaluated only once
> --
>
> Key: NUTCH-2924
> URL: https://issues.apache.org/jira/browse/NUTCH-2924
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.16
>    Reporter: Markus Jelsma
>    Assignee: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-2924-1.patch, NUTCH-2924-2.patch, 
> NUTCH-2924-3.patch, NUTCH-2924-4.patch, NUTCH-2924-5.patch, NUTCH-2924.patch
>
>
> The generate.maxCount expression is evaluated only once in the generator's 
> reducer, instead, it must be set once per host.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-07 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644491#comment-17644491
 ] 

Markus Jelsma commented on NUTCH-2978:
--

Yes, i saw the slf4j present in the plugin, it troubled my already when i 
attempted an upgrade to a newer Tika version.

Regarding reload4j, i was already worried it might not run in distributed mode 
but haven't tested it yet. For now i am glad enough Nutch runs our Tika based 
parser in local mode.

To be continued

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>    Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2924) Generate maxCount expr evaluated only once

2022-12-07 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644489#comment-17644489
 ] 

Markus Jelsma commented on NUTCH-2924:
--

Yes, that is expected. This patch requires a hostdb to be configured and 
present, i will add a check for that.

> Generate maxCount expr evaluated only once
> --
>
> Key: NUTCH-2924
> URL: https://issues.apache.org/jira/browse/NUTCH-2924
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.16
>    Reporter: Markus Jelsma
>    Assignee: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-2924-1.patch, NUTCH-2924-2.patch, 
> NUTCH-2924-3.patch, NUTCH-2924-4.patch, NUTCH-2924.patch
>
>
> The generate.maxCount expression is evaluated only once in the generator's 
> reducer, instead, it must be set once per host.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-2977) Support for showing dependency tree

2022-12-07 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-2977.
--
Fix Version/s: 1.20
   Resolution: Fixed

> Support for showing dependency tree
> ---
>
> Key: NUTCH-2977
> URL: https://issues.apache.org/jira/browse/NUTCH-2977
> Project: Nutch
>  Issue Type: Task
>    Reporter: Markus Jelsma
>    Assignee: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-2977.patch
>
>
> I am upgrading Nutch to slf4j 2 and need to get rid of old 1.7 stuff, and 
> especially reload4j. I desperately need this function for that.
>  
> $ ant dependencytree
>  
> will now show the tree.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2977) Support for showing dependency tree

2022-12-07 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644436#comment-17644436
 ] 

Markus Jelsma commented on NUTCH-2977:
--

{color:#00}Committed:{color}
  ed7b6615b..d806aa450  master -> master

> Support for showing dependency tree
> ---
>
> Key: NUTCH-2977
> URL: https://issues.apache.org/jira/browse/NUTCH-2977
> Project: Nutch
>  Issue Type: Task
>    Reporter: Markus Jelsma
>    Assignee: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2977.patch
>
>
> I am upgrading Nutch to slf4j 2 and need to get rid of old 1.7 stuff, and 
> especially reload4j. I desperately need this function for that.
>  
> $ ant dependencytree
>  
> will now show the tree.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-07 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2978:
-
Description: 
I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
log4j -> reload4j.

 

This patch fixes it.

  was:
I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
log4j -> reload4j.

 

 


> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>    Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
> This patch fixes it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-07 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2978:
-
Description: 
I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
log4j -> reload4j.

 

 

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>    Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978.patch
>
>
> I got in trouble upgrading some dependencies and got a lot of LinkageErrors 
> today, or with a Tika upgrade, disappearing logs. This patch fixes that by 
> moving to slf4j2, using the corrent log4j2-slfj4-impl2 and getting rid of old 
> log4j -> reload4j.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-07 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2978:
-
Attachment: NUTCH-2978.patch

> Move to slf4j2 and remove log4j1 and reload4j
> -
>
> Key: NUTCH-2978
> URL: https://issues.apache.org/jira/browse/NUTCH-2978
> Project: Nutch
>  Issue Type: Task
>    Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2978.patch
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-2978) Move to slf4j2 and remove log4j1 and reload4j

2022-12-07 Thread Markus Jelsma (Jira)
Markus Jelsma created NUTCH-2978:


 Summary: Move to slf4j2 and remove log4j1 and reload4j
 Key: NUTCH-2978
 URL: https://issues.apache.org/jira/browse/NUTCH-2978
 Project: Nutch
  Issue Type: Task
Reporter: Markus Jelsma
 Attachments: NUTCH-2978.patch





--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2977) Support for showing dependency tree

2022-12-07 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2977:
-
Attachment: NUTCH-2977.patch

> Support for showing dependency tree
> ---
>
> Key: NUTCH-2977
> URL: https://issues.apache.org/jira/browse/NUTCH-2977
> Project: Nutch
>  Issue Type: Task
>    Reporter: Markus Jelsma
>    Assignee: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2977.patch
>
>
> I am upgrading Nutch to slf4j 2 and need to get rid of old 1.7 stuff, and 
> especially reload4j. I desperately need this function for that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-2977) Support for showing dependency tree

2022-12-07 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2977:
-
Description: 
I am upgrading Nutch to slf4j 2 and need to get rid of old 1.7 stuff, and 
especially reload4j. I desperately need this function for that.

 

$ and dependencytree

 

will now show the tree.

  was:I am upgrading Nutch to slf4j 2 and need to get rid of old 1.7 stuff, and 
especially reload4j. I desperately need this function for that.


> Support for showing dependency tree
> ---
>
> Key: NUTCH-2977
> URL: https://issues.apache.org/jira/browse/NUTCH-2977
> Project: Nutch
>  Issue Type: Task
>    Reporter: Markus Jelsma
>    Assignee: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2977.patch
>
>
> I am upgrading Nutch to slf4j 2 and need to get rid of old 1.7 stuff, and 
> especially reload4j. I desperately need this function for that.
>  
> $ and dependencytree
>  
> will now show the tree.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-2977) Support for showing dependency tree

2022-12-07 Thread Markus Jelsma (Jira)
Markus Jelsma created NUTCH-2977:


 Summary: Support for showing dependency tree
 Key: NUTCH-2977
 URL: https://issues.apache.org/jira/browse/NUTCH-2977
 Project: Nutch
  Issue Type: Task
Reporter: Markus Jelsma
Assignee: Markus Jelsma


I am upgrading Nutch to slf4j 2 and need to get rid of old 1.7 stuff, and 
especially reload4j. I desperately need this function for that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   10   >