Inquiries on potential improvements

2023-01-20 Thread Kamil Mroczek
Hello,

I have a few improvements to Nutch that I would like to get feedback on
whether this community thinks I should submit them to the main branch. Once
I get my first PR approved I can start to add these. Some of these might
not be good ideas as well so happy to hear that feedback.

1. json-indexer: indexes documents in json lines format

2. selenium extracts the html tag vs the body tag (sample commit
):
I needed this to extract the title of the page since that often lives in
the head tag. I am hesitant about this change because it could have bigger
effects.

3. Add ability to extract meta tags with "property" attribute (sample commit

).

4. Allow selenium to handle gzip content (sample commit
):
This is a port of the code from HTMLUnit that does the same thing. I needed
this to process RSS feeds properly.

5. Treat RSS feeds as normal webpages by adding links to next segment fetch
(sample  commit

)


Kamil


[jira] [Commented] (NUTCH-2980) Upgrade Selenium Java to 4.7.2

2023-01-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17679189#comment-17679189
 ] 

ASF GitHub Bot commented on NUTCH-2980:
---

KamilMroczek commented on PR #753:
URL: https://github.com/apache/nutch/pull/753#issuecomment-1398516536

   Yeah the verifying of the licenses was a bit of a pain. I found some tools 
to help with finding licenses for a batch of libraries but none of them (which 
I could find) supported our format.




> Upgrade Selenium Java to 4.7.2
> --
>
> Key: NUTCH-2980
> URL: https://issues.apache.org/jira/browse/NUTCH-2980
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, protocol
>Affects Versions: 1.19
>Reporter: Kamil Mroczek
>Priority: Major
> Fix For: 1.20
>
>
> Selenium version is quite old and had some issues processing a website. Once 
> I switched to the latest version I was able to scrape that websites. Good to 
> keep it up to date since we were already 1 major release behind.
> Upgrading Selenium Java from 3.141.59 to 4.7.2 and Selenium HTMLUnit from 
> 2.35.1 to 4.7.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [nutch] KamilMroczek commented on pull request #753: NUTCH-2980: Upgraded Selenium to 4.7.2 + HTMLUnit

2023-01-20 Thread GitBox


KamilMroczek commented on PR #753:
URL: https://github.com/apache/nutch/pull/753#issuecomment-1398516536

   Yeah the verifying of the licenses was a bit of a pain. I found some tools 
to help with finding licenses for a batch of libraries but none of them (which 
I could find) supported our format.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (NUTCH-2981) Automatic creation of license list and notice files

2023-01-20 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-2981:
--

 Summary: Automatic creation of license list and notice files
 Key: NUTCH-2981
 URL: https://issues.apache.org/jira/browse/NUTCH-2981
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 1.20
Reporter: Sebastian Nagel
 Fix For: 1.20


Compiling the list of licenses and the notice files is a complex task, recently 
done in NUTCH-2290 [PR#743|https://github.com/apache/nutch/pull/743]. This is 
supported by a Python Jupyter notebook, but should be further automatized, so 
that it can be run if necessary - on dependency upgrades or before releases. 
See also STORM-3361, STORM-3437 and [Storm 
dev-tools|https://github.com/apache/storm/tree/master/dev-tools].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Upgrading Selenium

2023-01-20 Thread Markus Jelsma
> There must be a way, some how, some time.

There isn't:
https://github.com/seleniumhq/selenium-google-code-issue-archive/issues/141

Op do 19 jan. 2023 om 15:23 schreef Markus Jelsma <
markus.jel...@openindex.io>:

> > This makes some sense if you do not know anything about the URL.
> > - a HEAD request could do almost the same
> > - often one knows whether there are only HTML pages or also PDFs, zip
> files,
> >and other stuff not suitable for Selenium. Could make the HEAD request
> >optional.
>
> Ah crap, i forgot about that. With Selenium, it is still not possible to
> get the HTTP headers of the most recent request. And when requesting the
> page source, it will either return nothing, or the previous 'successful'
> call when requesting a non-text MIME-type URL.
>
> Besides doing a HEAD request first, there is no neat way to work with
> non-text/html URLs as we can using HtmlUnit. That at least returns the
> headers and the raw binary data.
>
> There must be a way, some how, some time.
>
> Thanks,
> Markus
>
> Op do 19 jan. 2023 om 11:38 schreef Sebastian Nagel <
> wastl.na...@googlemail.com>:
>
>> Hi Kamil, hi Markus,
>>
>> upgrading the Selenium plugin is very appreciated!
>>
>>  > Besides that, the plugin also needs some overhaul.
>>
>> Definitely.
>>
>>  > It currently first downloads the URL with HttpClient, and then,
>> depending on
>>  > MIME-type, it may or may not forward the URL to Selenium so it can be
>>  > downloaded again.
>>
>> This makes some sense if you do not know anything about the URL.
>> - a HEAD request could do almost the same
>> - often one knows whether there are only HTML pages or also PDFs, zip
>> files,
>>and other stuff not suitable for Selenium. Could make the HEAD request
>>optional.
>>
>>  > merging the lib-selenium plugin with the protocol-selenium plugin
>>
>> I guess lib-selenium is to share common components between
>> protocol-selenium and
>> protocol-interactiveselenium. Maybe merge all three? Or skip
>> interactiveselenium
>> for now.
>>
>> ~Sebastian
>>
>> On 1/17/23 19:56, Markus Jelsma wrote:
>> > Hello Kamil,
>> >
>> > Yes, the plugin needs some upgrading indeed. We use a modern version of
>> it
>> > elsewhere and it works really well, at least better than HtmlUnit.
>> >
>> > Besides that, the plugin also needs some overhaul. It currently first
>> downloads
>> > the URL with HttpClient, and then, depending on MIME-type, it may or
>> may not
>> > forward the URL to Selenium so it can be downloaded again.
>> >
>> > There is a lot of code in the plugin that should be removed. I would
>> also opt
>> > for merging the lib-selenium plugin with the protocol-selenium plugin.
>> There is
>> > no obvious need for having it separated.
>> >
>> > These can be, of course, separate tasks.
>> >
>> > Regards,
>> > Markus
>> >
>> > Op di 17 jan. 2023 om 17:49 schreef Kamil Mroczek :
>> >
>> > Hello,
>> >
>> > I am sending a message to inquire whether I should submit a patch
>> which
>> > updates selenium to the latest version. Although it is a major
>> version
>> > upgrade to the library, very few code changes were needed to update.
>> >
>> > For a preview of the changes I made you can look here
>> > <
>> https://github.com/Elio-Earth/nutch/commit/9960f14bce0f0d6cebc406556a298a7c8c2e6b9f>.
>> Although not used in the code anymore (it was commented out), PhantomJS
>> support has been removed from Selenium in the latest version. The commit
>> also removes Opera since it was commented out but I can leave that in if
>> needed. The build and tests pass. I have been using the Chrome driver
>> successfully with it and would just need to run a quick test with Firefox
>> to make sure it works too.
>> >
>> > I have only been using Nutch for about a month but have spent quite
>> a bit of
>> > time looking over different parts of the code to understand how to
>> configure
>> > it and change it.
>> >
>> > Kamil
>> >
>>
>