Re: [ANNOUNCE] New Nutch committer and PMC - Tim Allison

2023-07-20 Thread Tim Allison
Thank you, all!  I’m thrilled to join the team!

On Thu, Jul 20, 2023 at 9:42 AM Julien Nioche 
wrote:

> What a fantastic addition to the Nutch team! Congrats to Tim
>
> On Thu, 20 Jul 2023 at 10:20, Sebastian Nagel  wrote:
>
>> Dear all,
>>
>> It is my pleasure to announce that Tim Allison has joined us
>> as a committer and member of the Nutch PMC.
>>
>> You may already know Tim as a maintainer of and contributor to
>> Apache Tika. So, it was great to see contributions to the
>> Nutch source code from an experienced developer who is also
>> active in a related Apache project. Among other contributions
>> Tim recently implemented the indexer-opensearch plugin.
>>
>> Thank you, Tim Allison, and congratulations on your new role
>> in the Apache Nutch community! And welcome on board!
>>
>> Sebastian
>> (on behalf of the Nutch PMC)
>
>
>>
>
> --
>
> *Open Source Solutions for Text Engineering*
>
> http://www.digitalpebble.com
> http://digitalpebble.blogspot.com/
> #digitalpebble 
>


Re: [ANNOUNCE] New Nutch committer and PMC - Tim Allison

2023-07-20 Thread Julien Nioche
What a fantastic addition to the Nutch team! Congrats to Tim

On Thu, 20 Jul 2023 at 10:20, Sebastian Nagel  wrote:

> Dear all,
>
> It is my pleasure to announce that Tim Allison has joined us
> as a committer and member of the Nutch PMC.
>
> You may already know Tim as a maintainer of and contributor to
> Apache Tika. So, it was great to see contributions to the
> Nutch source code from an experienced developer who is also
> active in a related Apache project. Among other contributions
> Tim recently implemented the indexer-opensearch plugin.
>
> Thank you, Tim Allison, and congratulations on your new role
> in the Apache Nutch community! And welcome on board!
>
> Sebastian
> (on behalf of the Nutch PMC)
>


-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


[jira] [Commented] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-07-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17745062#comment-17745062
 ] 

ASF GitHub Bot commented on NUTCH-2993:
---

sebastian-nagel opened a new pull request, #764:
URL: https://github.com/apache/nutch/pull/764

   - apply patch contributed by Markus Jelsma




> ScoringDepth plugin to skip depth check based on URL Pattern
> 
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15-1.patch, NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites.
> This patch overrides maxDepth for outlinks of URLs matching a configured 
> pattern. URL not matching the pattern get the default max depth value 
> configured.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-07-20 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17745061#comment-17745061
 ] 

Sebastian Nagel commented on NUTCH-2993:


Hi [~markus17], the updated patch looks good. I'll open a PR for easier 
merging. Alternatively, feel free to apply the patch to master and commit it!

> Now the maxDepth resets if it does NOT match/find the pattern.

It sounds plausible to reset the increased depth back to the default value if 
the URL isn't matched.

Generally, I agree that steering a crawler and keep it focused is always 
difficult. We also should update the plugin scoring-similarity, not sure 
whether it's still working with a recent Hadoop and a quite old Mahout version 
combined. Interestingly, 
[SimilarityScoringFilter|https://cwiki.apache.org/confluence/display/NUTCH/SimilarityScoringFilter]
 mentions the filtering of URLs/outlinks as a nice-to-have option.

> ScoringDepth plugin to skip depth check based on URL Pattern
> 
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15-1.patch, NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites.
> This patch overrides maxDepth for outlinks of URLs matching a configured 
> pattern. URL not matching the pattern get the default max depth value 
> configured.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[ANNOUNCE] New Nutch committer and PMC - Tim Allison

2023-07-20 Thread Sebastian Nagel

Dear all,

It is my pleasure to announce that Tim Allison has joined us
as a committer and member of the Nutch PMC.

You may already know Tim as a maintainer of and contributor to
Apache Tika. So, it was great to see contributions to the
Nutch source code from an experienced developer who is also
active in a related Apache project. Among other contributions
Tim recently implemented the indexer-opensearch plugin.

Thank you, Tim Allison, and congratulations on your new role
in the Apache Nutch community! And welcome on board!

Sebastian
(on behalf of the Nutch PMC)