Re: GeoIP Plugin - Domain Field Not Indexed

2024-08-02 Thread Sebastian Nagel
Hi James, thanks for the update! Would you mind to share your solution? Just thinking about the next user searching for the same problem... Otherwise: I never indexed the GeoIP domain. And yes, the index-geoip plugin isn't easy to configure, see https://nutch.apache.org/documentation/javadoc/

Re: Protocol-http not storing response headers

2024-07-31 Thread Sebastian Nagel
Hi Markus, >> And i do not agree with it. Almost all content is compressed now, so this >> will never work. We need the headers and response code stored for WARC >> export and do not care about an incorrect length header. No, don't do this. You need to rewrite the header. There are many WARC rea

Re: Help posting question

2024-04-25 Thread Sebastian Nagel
Hi Sheham, the nutch-site.xml configures mapreduce.task.timeout 1800 1.8 seconds (1800 milliseconds) is very short. The default is 600 seconds or 10 minutes, see [1]. Since Nutch needs to finish fetching before the task timeout applies, threads fetching not quickly enough and st

Re: [VOTE] Apache Nutch 1.20 Release

2024-04-11 Thread Sebastian Nagel
, see https://github.com/sebastian-nagel/nutch-test-single-node-cluster/ One note about the CHANGES.md: it's now a mixture of HTML and plain text. It does not use the potential of markdown, e.g. sections / headlines for the releases to make the change log navigable via a table of contents. Th

Re: truncation, parsing and indexing?

2023-10-23 Thread Sebastian Nagel
Hi Tim, >> I'm using the okhttp protocol, because I don't think the http protocol >> stores truncation information. protocol-http could mark truncations as well, however. Please, also open an issue for this and other protocol plugins. >> Should I open a ticket to have ParseSegment also check

Re: Exclude HTML elements from Crawl

2023-09-22 Thread Sebastian Nagel
Hi Michael, > I wonder if there is not already a build-in option to exclude HTML > elements (like a div with a given id or class or other elements like header). No, there isn't one so far. > I know https://issues.apache.org/jira/browse/NUTCH-585 > I also do not understand why this little patc

Re: Change log file directory

2023-08-07 Thread Sebastian Nagel
Hi, yes, this is possible by pointing the environment variable NUTCH_LOG_DIR to a different folder. The default is: $NUTCH_HOME/logs/ See also the script bin/nutch which is called by bin/crawl: https://github.com/apache/nutch/blob/master/src/bin/nutch#L30 (it's also possible to change the log f

Re: Maximum header limit (1000) exceeded

2023-07-26 Thread Sebastian Nagel
e Cohen On Wed, Jul 26, 2023 at 10:36 AM Sebastian Nagel wrote: Hi Steve, > file:/RMS/sha256/a0/ec/b0/a0/e0/ef/80/74/a0ecb0a0e0ef80747871563e2060b028c3abd330cb644ef7ee86fa9b133cbc67 what does the file contain? An .eml file (following RFC822)? Would it be possible to share this file or at l

Re: Maximum header limit (1000) exceeded

2023-07-26 Thread Sebastian Nagel
Hi Steve, > file:/RMS/sha256/a0/ec/b0/a0/e0/ef/80/74/a0ecb0a0e0ef80747871563e2060b028c3abd330cb644ef7ee86fa9b133cbc67 what does the file contain? An .eml file (following RFC822)? Would it be possible to share this file or at least a chunk large enough to reproduce the issue? The error message

[ANNOUNCE] New Nutch committer and PMC - Tim Allison

2023-07-20 Thread Sebastian Nagel
Dear all, It is my pleasure to announce that Tim Allison has joined us as a committer and member of the Nutch PMC. You may already know Tim as a maintainer of and contributor to Apache Tika. So, it was great to see contributions to the Nutch source code from an experienced developer who is also

Re: Nutch 1.19 Getting Error: 'boolean org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(java.lang.String, int)'

2023-05-15 Thread Sebastian Nagel
Hi Eric, unfortunately, on Windows you also need to download and install winutils.exe and hadoop.dll, see https://github.com/cdarlint/winutils and https://stackoverflow.com/questions/41851066/exception-in-thread-main-java-lang-unsatisfiedlinkerror-org-apache-hadoop-io The installation of Ha

Re: Merging CrawlDBs

2023-02-02 Thread Sebastian Nagel
Hi Kamil, > I was wondering if this script is advisable to use? I haven't tried the script itself but some of the underlying commands - mergedb, etc. > merge command ($nutch_dir/nutch merge $index_dir $new_indexes) Of course, some of the commands are obsolete. Long time ago, Nutch used Lucene

Re: Unsubscribe from Users list

2023-01-25 Thread Sebastian Nagel
Hi, please send a mail to user-unsubscr...@nutch.apache.org See https://nutch.apache.org/community/mailing-lists/ Thanks! Best, Sebastian On 1/25/23 14:53, Steven Zhu wrote: Please unsubscribe me from the users list. Steven On Tue, Jan 24, 2023 at 10:27 PM Ankit gupta wrote: Hell

Re: "Unparseable date" build issue with ANT on AWS EMR

2023-01-17 Thread Sebastian Nagel
owse/NUTCH-2974 Just in case you want to try it. ~Sebastian On 11/21/22 10:36, Sebastian Nagel wrote: Hi Kamil, thanks for trying and finding a solution! I've open a JIRA issue to track the problem: https://issues.apache.org/jira/browse/NUTCH-2974 Thanks! Sebastian On 11/19/22 18:37

Re: Configuration Nutch in cluster mode

2023-01-17 Thread Sebastian Nagel
Hadoop cluster. All commands are the same than in fully distributed mode. If it helps, I prepared some setup scripts to run Nutch in pseudo-distributed mode: https://github.com/sebastian-nagel/nutch-test-single-node-cluster Best, Sebastian On 1/15/23 04:26, Mike wrote: I will now try to confi

Re: Nutch/Hadoop Cluster

2023-01-17 Thread Sebastian Nagel
Hi Mike, > It can be tedious to set up for the first time, and there are many components. In case you prefer Linux packages, I can recommend Apache Bigtop, see https://bigtop.apache.org/ and for the list of package repositories https://downloads.apache.org/bigtop/stable/repos/ ~Sebastian

Re: CSV indexer file data overwriting

2022-11-24 Thread Sebastian Nagel
Hi Paul, > the indexer was writing the > documents info in the file (nutch.csv) twice, Yes, I see. And now I know what I've overseen: .../bin/nutch index -Dmapreduce.job.reduces=2 You need to run the CSV indexer with only a single reducer. In order to do so, please pass the option --num-tas

Re: CSV indexer file data overwriting

2022-11-23 Thread Sebastian Nagel
Hi Paul, as far I can see the indexer is run only once and now indexes 26 documents: org.apache.nutch.indexer.IndexingJob 2022-11-22 06:32:57,164 INFO o.a.n.i.IndexingJob [main] Indexer: 26 indexed (add/update) The logs also indicate that both segments are indexed at once: org.apache.nu

Re: "Unparseable date" build issue with ANT on AWS EMR

2022-11-21 Thread Sebastian Nagel
Hi Kamil, thanks for trying and finding a solution! I've open a JIRA issue to track the problem: https://issues.apache.org/jira/browse/NUTCH-2974 Thanks! Sebastian On 11/19/22 18:37, Kamil Mroczek wrote: I've been able to work around this issue by adding "pattern" to touch tag on line 101 i

[DISCUSS] Bug reporting - enabling Github issues?

2022-11-21 Thread Sebastian Nagel
Hi everybody, because of a growing number of spam account creation public sign-ups to the Apache JIRA have been disabled. In order to allow users to report bugs, we have two options: 1 either users let us know about the issue on the mailing list and one of the Nutch PMC creates a user account

Re: CSV indexer file data overwriting

2022-11-21 Thread Sebastian Nagel
Hi Paul, yes, the CSV indexer removes the CSV output before it starts a new one. The problem here is that the indexer is run twice in a loop. Possible work-arounds - assumed you're using the script bin/crawl: 1 after each indexing command in the loop, move the CSV output so that it gets not d

Re: Incomplete TLD List

2022-11-08 Thread Sebastian Nagel
Hi Mike, hi Markus, there's also https://issues.apache.org/jira/browse/NUTCH-1806 which would make it much easier to keep up-to-date with the public suffix list. Resp., because crawler-commons loads the public suffix list (for historic reasons named "effective_tld_names.dat") from the class pa

[ANNOUNCE] Apache Nutch 1.19 Release

2022-09-08 Thread Sebastian Nagel
The Apache Nutch team is pleased to announce the release of Apache Nutch v1.19. Nutch is a well matured, production ready Web crawler. Nutch 1.x enables fine grained configuration, relying on Apache Hadoop™ data structures. Source and binary distributions are available for download from the Apach

[RESULT] was [VOTE] Release Apache Nutch 1.19 RC#1

2022-09-06 Thread Sebastian Nagel
Hi Folks, thanks to everyone who was able to review the release candidate! 72 hours have definitely passed, please see below for vote results. [4] +1 Release this package as Apache Nutch 1.19 Markus Jelsma * BlackIce * Jorge Betancourt * Sebastian Nagel * [0] -1 Do not release this

Re: Nutch 1.19 schema.xml

2022-09-04 Thread Sebastian Nagel
> > Thanks > Mike > > Am Fr., 2. Sept. 2022 um 13:25 Uhr schrieb Sebastian Nagel > : > >> Hi Mike, >> >> the Nutch/Solr schema.xml will be updated with the release of 1.19 >> (expected >> soon, a vote about RC#1 is ongoing): >> [NUTCH-

Re: [VOTE] Release Apache Nutch 1.19 RC#1

2022-09-02 Thread Sebastian Nagel
es > file in the cache. > > Since Ralf can compile it without problems, it seems to be an issue on my > machine only. So Nutch seems fine, therefore +1. > > Regards, > Markus > > [1] > https://repo1.maven.org/maven2/org/apache/httpcomponents/httpasyncclient/4.1.4/ > &

Re: Nutch 1.19 schema.xml

2022-09-02 Thread Sebastian Nagel
Hi Mike, the Nutch/Solr schema.xml will be updated with the release of 1.19 (expected soon, a vote about RC#1 is ongoing): [NUTCH-2955] - replace deprecated/removed field type solr.LatLonType [NUTCH-2957] - add fall-back field definitions for unknown index fields [NUTCH-2956] - typos in field n

Re: [VOTE] Release Apache Nutch 1.19 RC#1

2022-08-28 Thread Sebastian Nagel
pl/StaticLoggerBinder.class] >>>> >>>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an >>>> explanation. >>>> SLF4J: Actual binding is of type >>>> [org.apache.logging.slf4j.Log4jLoggerFactory] >>>>

Re: [VOTE] Release Apache Nutch 1.19 RC#1

2022-08-28 Thread Sebastian Nagel
nitialize the log4j system properly. > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for > more info. > > I am worried about the indexer-elastic plugin, maybe others have that > problem too? Otherwise everything seems fine. > > Markus > > Op ma

[VOTE] Release Apache Nutch 1.19 RC#1

2022-08-22 Thread Sebastian Nagel
://github.com/sebastian-nagel/nutch-test-single-node-cluster/)

Re: [DISCUSS] Release 1.19 ?

2022-08-10 Thread Sebastian Nagel
Jelsma wrote: > Sounds good! > > I see we're still at Tika 2.3.0, i'll submit a patch to upgrade to the > current 2.4.1. > > Thanks! > Markus > > Op di 9 aug. 2022 om 09:11 schreef Sebastian Nagel : > >> Hi all, >> >> more than 60 issues

[DISCUSS] Release 1.19 ?

2022-08-09 Thread Sebastian Nagel
Hi all, more than 60 issues are done for Nutch 1.19 https://issues.apache.org/jira/projects/NUTCH/versions/12349580 including - important dependency upgrades - Hadoop 3.3.3 - Any23 2.7 - Tika 2.3.0 - plugin-specific URL stream handlers (NUTCH-2429) - migration - from Java/JDK 8

Re: Unable to create core Caused by: solr.LatLonType

2022-08-06 Thread Sebastian Nagel
Fyi, the issue is tracked on https://issues.apache.org/jira/browse/NUTCH-2955 ~Sebastian On 7/14/22 12:54, Sebastian Nagel wrote: > Hi Mike, > > if you do not use the plugin index-geoip, you could simply delete the line > > subFieldSuffix="_coordinate&

Re: Question about Nutch plugins

2022-07-24 Thread Sebastian Nagel
Hi Rastko, the description isn't really correct now as NUTCH_HOME is supposed to point to the runtime - if the binary package is used: this is the base folder of the package, eg. apache-nutch-1.18/ - if Nutch is built from the source, you usually point NUTCH_HOME to runtime/local/ - the dire

Re: Problem with Nutch <-> Eclipse

2022-07-18 Thread Sebastian Nagel
Hi Bob, could you share which instructions and when the error happens - during import, project build, running/debugging? The usual way is 1. to write the Eclipse project configuration, run ant eclipse 2. import the written project configuration into Eclipse Building or running/debugging N

Re: Unable to create core Caused by: solr.LatLonType

2022-07-14 Thread Sebastian Nagel
Hi Mike, if you do not use the plugin index-geoip, you could simply delete the line Otherwise, after the deprecation and the removal of the LatLonType class [1], it should be: But I haven't verified whether indexing with index-geoip enabled and the retrieval works. In any case, please

Re: Does Nutch work with Hadoop Versions greater than 3.1.3?

2022-06-13 Thread Sebastian Nagel
Hi Michael, Nutch (1.18, and trunk/master) should work together with more recent Hadoop versions. At Common Crawl we use a modified Nutch version based on the recent trunk running on Hadoop 3.2.2 (soon 3.2.3) and Java 11, even on a mixed Hadoop cluster with x64 and arm64 AWS EC2 instances. But I

Re: FW: After update from 1.11 to 1.13 form login does not work

2022-05-10 Thread Sebastian Nagel
Hi Michael, the only differences in the protocol-httpclient plugin between Nutch 1.11 and 1.13 are - NUTCH-2280 [1] which allows to configure the cookie policy - NUTCH-2355 [2] which allows to set an explicit cookie for a request URL Could this be related? Are there any useful hints what could b

Re: Nutch not crawling all URLs

2022-01-13 Thread Sebastian Nagel
indexed data to MongoDB for further > processing. > > Kind regards, > Roseline > > > > > > -Original Message- > From: Sebastian Nagel > Sent: 12 January 2022 16:12 > To: user@nutch.apache.org > Subject: Re: Nutch not crawling all URLs >

Re: Nutch not crawling all URLs

2022-01-12 Thread Sebastian Nagel
nonnegative (>=0), content longer > than it will be truncated; otherwise, no truncation at all. Do not > confuse this setting with the file.content.limit setting. > > > > db.ignore.external.links.mode > byHost > > > db.injector.overwrite > true &g

Re: Nutch not crawling all URLs

2021-12-13 Thread Sebastian Nagel
Hi Ayhan, you mean? https://stackoverflow.com/questions/69352136/nutch-does-not-crawl-sites-that-allows-all-crawler-by-robots-txt Sebastian On 12/13/21 20:59, Ayhan Koyun wrote: > Hi, > > as I wrote before, it seems that I am not the only one who can not crawl all > the seed.txt url's. I could

Re: Nutch not crawling all URLs

2021-12-13 Thread Sebastian Nagel
Hi Roseline, > 5,36405,0,http://www.notco.com What is the status for https://notco.com/which is the final redirect target? Is the target page indexed? ~Sebastian

Re: Nutch not crawling all URLs

2021-12-13 Thread Sebastian Nagel
> Dr Roseline Antai > Research Fellow > Hunter Centre for Entrepreneurship > Strathclyde Business School > University of Strathclyde, Glasgow, UK > > > The University of Strathclyde is a charitable body, registered in Scotland, > number SC015263. > > > -

Re: Error When Connecting Elasticsearch with HTTPS Connection

2021-11-18 Thread Sebastian Nagel
Hi Shi Wei, fyi: a fix for NUTCH-2903 is ready https://github.com/apache/nutch/pull/703 Sebastian On 11/16/21 13:54, Sebastian Nagel wrote: > Hi Shi Wei, > > looks like you're the first trying to connect to ES from Nutch over > HTTPS. HTTP is used as default scheme and t

Re: javax.net.ssl.SSLHandshakeException Error when Executing Nutch with Selenium Plugin

2021-11-18 Thread Sebastian Nagel
The issue is now tracked in https://issues.apache.org/jira/browse/NUTCH-2907 On 10/28/21 15:31, Sebastian Nagel wrote: > Hi Shi Wei, > > sorry, but it looks like the Selenium protocol plugin has never been > used with a proxy over https. There are two points which need (at a >

Re: encrypt password of the index-writer.xml

2021-11-17 Thread Sebastian Nagel
gt; following in the log4j.properties but it doesn't help. >   > log4j.logger.org.apache.nutch.indexwriter.elastic.ElasticIndexWriter=WARN,cmdstdout > log4j.logger.org.apache.nutch.indexwriter.elastic.ElasticUtils=WARN,cmdstdout >   >   > Best Regards, > Shi Wei >   > O

Re: Error When Connecting Elasticsearch with HTTPS Connection

2021-11-16 Thread Sebastian Nagel
Hi Shi Wei, looks like you're the first trying to connect to ES from Nutch over HTTPS. HTTP is used as default scheme and there is no way to configure the Elasticsearch index writer to use HTTPS. Please open a Jira issue. It's a trivial fix. For a quick fix: in the Nutch source package (or git

Re: JEXL unable to handle "if" statements?

2021-11-11 Thread Sebastian Nagel
Hi Max, fyi, the Jira issue is created: https://issues.apache.org/jira/browse/NUTCH-2902 (to make sure that this is not forgotten) Thanks, Sebastian On 10/11/21 18:11, Sebastian Nagel wrote: > Hi Max, > >> I was able to fix this by switching from JexlExpression to JexlScript.

Re: encrypt password of the index-writer.xml

2021-11-11 Thread Sebastian Nagel
Hi Shi Wei, there is a way, although definitely not the recommended one. Sorry, and it took me a little bit to proof it. Do you know about external XML entities or XXE attacks? 1. On top of the index-writers.xml you add an entity declaration: ]> 2. it's used later in the index writer spec:

Re: javax.net.ssl.SSLHandshakeException Error when Executing Nutch with Selenium Plugin

2021-10-28 Thread Sebastian Nagel
Hi Shi Wei, sorry, but it looks like the Selenium protocol plugin has never been used with a proxy over https. There are two points which need (at a first glance) a rework: 1. the protocol tries to establish a TLS/SSL connection to the proxy if the URL to be crawled is a https:// URL. There might

Re: Encrypt or Mask the password

2021-10-25 Thread Sebastian Nagel
HTTP Authentication Scheme > > Your sincerely, > Shi Wei > > -Original Message- > From: Sebastian Nagel > Sent: Monday, 25 October, 2021 5:31 PM > To: user@nutch.apache.org > Subject: Re: Encrypt or Mask the password > > Hi Shi Wei, > > for the

Re: Encrypt or Mask the password

2021-10-25 Thread Sebastian Nagel
Hi Shi Wei, for the nutch-site.xml it's possible to use Java properties and/or environment variables, see section "Variable expansion" in https://hadoop.apache.org/docs/r3.3.1/api/org/apache/hadoop/conf/Configuration.html In case you're asking about index-writers.xml - variable expansion (likely

Re: [Non-DoD Source] Re: Cant integrate the kerberos enabled solr cloud with nutch (UNCLASSIFIED)

2021-10-22 Thread Sebastian Nagel
SharePoint Team Requirements Analyst (443) 861-8623 APG Bldg 6002 D5101/108 I am currently teleworking and can be reached at CELL - (860) 670 9494 -Original Message----- From: Sebastian Nagel Sent: Friday, October 22, 2021 5:46 AM To: user@nutch.apache.org Subject: [Non-DoD Source] Re: Cant

Re: Cant integrate the kerberos enabled solr cloud with nutch

2021-10-22 Thread Sebastian Nagel
tps://solr.apache.org/guide/8_5/kerberos-authentication-plugin.html#using-solrj-with-a-kerberized-solr Thanks, Sebastian On 10/22/21 12:01 PM, sw.l...@quandatics.com wrote: Hi Sebastian, Here is the index-writers.xml you requested. Thank Your Sincerely, Shi Wei -Original Message- From: Sebastian Na

Re: Cant integrate the kerberos enabled solr cloud with nutch

2021-10-22 Thread Sebastian Nagel
Hi Shi Wei, could you also share the index writer configuration (conf/index-writers.xml)? The default is unauthenticated access to Solr, see the snippet below. The file httpclient-auth.xml is not relevant for the Solr indexer, it's used if a crawled web site requires authentication in order to f

Re: JEXL unable to handle "if" statements?

2021-10-11 Thread Sebastian Nagel
Hi Max, > I was able to fix this by switching from JexlExpression to JexlScript. I > have a small patch that I'm happy to contribute! Yes, that would be great! Please open also a Jira issue so that the problem shows up in the Changelog. Thanks! Best, Sebastian On 10/11/21 6:34 AM, Max Ockner

Re: OkHttp NoClassDefFoundError: okhttp3/Authenticator

2021-07-23 Thread Sebastian Nagel
Hi Markus, the okhttp protocol plugin should work out-of-the-box and we use it in production (currently on Hadoop 3.2.2) I remember that I had once an issue with the Hadoop library having okhttp as a dependency which then caused a conflict. It was solved by adding an exclusion rule to the Hadoop

Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-07-15 Thread Sebastian Nagel
Hi Clark, thanks for summarizing this discussion and sharing the final configuration! Good to know that it's possible to run Nutch on Hadoop using S3A without using HDFS (no namenode/datanodes running). Best, Sebastian

Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-06-15 Thread Sebastian Nagel
> The local file system? Or hdfs:// or even s3:// resp. s3a://? Also important: the value of "mapreduce.job.dir" - it's usually on hdfs:// and I'm not sure whether the plugin loader is able to read from other filesystems. At least, I haven't tried. On 6/15/21 10:

Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-06-15 Thread Sebastian Nagel
Hi Clark, sorry, I should read your mail until the end - you mentioned that you downgraded Nutch to run with JDK 8. Could you share to which filesystem does NUTCH_HOME point? The local file system? Or hdfs:// or even s3:// resp. s3a://? Best, Sebastian On 6/15/21 10:24 AM, Clark Benham wrote:

Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-06-15 Thread Sebastian Nagel
Hi Clark, the class URLNormalizer is not in a plugin - it's part of Nutch core and defines the interface for URL normalizer plugins. Looks like there's something wrong fundamentally, not only with the plugins. > I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3 Are you aware that the N

Re: Apache Nutch help request for a school project :)

2021-06-07 Thread Sebastian Nagel
Hi Gorkem, I haven't verified it by trying - but it may be that given your configuration the Solr instance isn't reachable via http://localhost:8983/solr/nutch Inside the Docker network, host names are the same as container names, that is http://solr:8983/solr/nutch might work. Cf. the docker

Re: Recommendation for free and production-ready Hadoop setup to run Nutch

2021-06-04 Thread Sebastian Nagel
Hi Lewis, hi Markus, > snappy compression, which is a massive improvement for large data shuffling jobs Yes, I can confirm this. Also: it's worth to consider zstd for all data kept for longer. We use it for a 25-billion CrawlDB: it's almost as fast (both compression and decompression) as snapp

Re: Recommendation for free and production-ready Hadoop setup to run Nutch

2021-06-04 Thread Sebastian Nagel
Hi Nicholas, thanks for the pointer. > What is the status of that project? It's definitely alive. And looks like it has improved recently, just compare the support for Linux distributions of the last two releases: https://mirror.synyx.de/apache/bigtop/bigtop-1.4.0/repos/ https://mirror.syny

Re: DuplexWeb-Google - GoogleBot Crawler For Duplex / Google Assistant

2021-06-04 Thread Sebastian Nagel
Thanks! Interesting that the dublexweb bot ignores the wildcard user agent rules by default. On 6/3/21 11:44 PM, lewis john mcgibbney wrote: Some interesting content for a short read :) https://www.seroundtable.com/duplexweb-google-bot-31522.html?utm_source=search_engine_roundtable&utm_campaig

Recommendation for free and production-ready Hadoop setup to run Nutch

2021-06-01 Thread Sebastian Nagel
m/big-data-europe/docker-hadoop [2] https://github.com/sebastian-nagel/docker-hadoop/tree/2.0.0-hadoop3.3.0-java11

Re: Adding html field to NutchDocument

2021-06-01 Thread Sebastian Nagel
in/crawl file. Although looking at it now it's clear. This makes it easier for me to access the html content within my plugin, thanks again On Fri, May 28, 2021 at 8:36 PM Sebastian Nagel wrote: Hi Kieran, see the command-line options -addBinaryContent index

Re: Adding html field to NutchDocument

2021-05-28 Thread Sebastian Nagel
Hi Kieran, see the command-line options -addBinaryContent index raw/binary content in field `binaryContent` -base64 use Base64 encoding for binary content of the Nutch index job [1]. Note that the content maybe indeed binary, eg. for PDF documents but also

Re: Crawling same domain URL's

2021-05-11 Thread Sebastian Nagel
Hi Prateek, alternatively, you could modify the URLPartitioner [1], so that during the "generate" step the URLs of a specific host or domain are distributed over more partitions. One partition is the fetch list of one fetcher map task. At Common Crawl we partition by domain and made the numbe

Re: Redirection behavior

2021-05-06 Thread Sebastian Nagel
nutchplugin.http.Http: fetching https://zyfro.com/ <https://zyfro.com/> 2021-05-05 17:35:30,786 INFO [main] org.apache.nutch.fetcher.FetcherThread: FetcherThread 50 has no more work available/ I am not sure what I am missing. Regards Prateek On Thu, May 6, 2021 at 10:21 AM Sebastian Nagel mailto:wastl.na...@goo

Re: Writing Nutch data in Parquet format

2021-05-05 Thread Sebastian Nagel
Hi Lewis, > 2) post-processing (Nutch) Hadoop sequence data by converting it to Parquet format? Yes, but not directly - it's a multi-step process. The outcome: https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/ This Parquet index is optimized by sorting the row

Re: googled for ever and still can't figure it out

2021-03-15 Thread Sebastian Nagel
Hi Andrew, > if this flag is used *--sitemaps-from-hostdb always* Do the crawled hosts announce the sitemap in their robots.txt? If not does the sitemap URLs follow the pattern http://example.com/sitemap.xml ? See https://cwiki.apache.org/confluence/display/NUTCH/SitemapFeature If this is no

Re: Extract all image and video links from a web page

2021-01-27 Thread Sebastian Nagel
Hi Prateek, are there any URL filters which filter away image links? You can verify this using the URL filter checker: echo "https://example.com/image.jpg"; \ | bin/nutch filterchecker -stdin The default rules in conf/regex-urlfilter.txt exclude common image suffixes. Note that there can b

Re: NUTCH-2353

2020-12-06 Thread Sebastian Nagel
Hi, no, NUTCH-2353 is still open, see https://issues.apache.org/jira/projects/NUTCH/issues/NUTCH-2353 The implementation caused a regression, so it was reverted. Best, Sebastian On 12/6/20 7:03 AM, Von Kursor wrote: > Hello > > Has this API enhancement been implemented under 1.17 ? > > I wa

Re: Nutch 2.4 with selenium

2020-10-10 Thread Sebastian Nagel
Hi, > Nutch 2.4 with selenium Nutch 2.4 does not include any plugin to use Selenium. In addition, 2.4 is for now the last release on the 2.x branch which is not maintained anymore. You should use 1.x (1.17 is the most recent release. > standalone nutch crawling with selenium. For 1.x there's a

Re: Unable to get search result using Javascript client..

2020-10-01 Thread Sebastian Nagel
Hi, this question is better asked on the Solr user mailing list as Nutch people are not necessarily familiar with Solr on a deep level. Please also share more details - which JavaScript client, the error message, the log messages of the Solr server at this time. This helps to trace the error down

Re: Regarding Nutch Hadoop Cluster Setup in Deploy Mode

2020-09-08 Thread Sebastian Nagel
from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10 > > From: Sebastian Nagel<mailto:wastl.na...@googlemail.com.INVALID> > Sent: Tuesday, August 11, 2020 4:56 PM > To: user@nutch.apache.org<mailto:user@nutch.apache.org> > Subject: Re: Regarding N

Re: Unable to index on Hadoop 3.2.0 with 1.16

2020-08-13 Thread Sebastian Nagel
-196X > http://www.researcherid.com/rid/F-3388-2013 > > > > > > > > > Sebastian Nagel , 13 Ağu 2020 Per, > 08:53 tarihinde şunu yazdı: > >> Hi Joe, >> >>> I eliminated it when I updated the index-writers.xml for the >> solr_indexer_1 >&g

Re: Unable to index on Hadoop 3.2.0 with 1.16

2020-08-12 Thread Sebastian Nagel
Hi Joe, > I eliminated it when I updated the index-writers.xml for the solr_indexer_1 > to use only a single URL. Thanks for the hint. I'm able to reproduce the error by adding an overlong URL to Could you open an issue to fix this on https://issues.apache.org/jira/projects/NUTCH ? Tha

Re: Regarding Nutch Hadoop Cluster Setup in Deploy Mode

2020-08-11 Thread Sebastian Nagel
Hi, Nutch does not include a search component anymore. These steps are obsolete. All you need is to setup your Hadoop cluster, then run $NUTCH_HOME/runtime/deploy/bin/nutch ... (instead of .../runtime/local/bin/nutch ...) Alternatively, you could launch a Nutch tool, eg. Injector the followin

[ANNOUNCE] New Nutch committer and PMC - Shashanka Balakuntala Srinivasa

2020-07-28 Thread Sebastian Nagel
Dear all, it is my pleasure to announce that Shashanka Balakuntala Srinivasa has joined us as a committer and member of the Nutch PMC. Shashanka Balakuntala has worked recently on a longer list of Nutch issues and improvements. Thanks, Shashanka Balakuntala, and congratulations on your new role

Re: Apache Nutch 1.16 Fetcher reducers?

2020-07-27 Thread Sebastian Nagel
to fetch Job directly > so see if there are some improvements. > > I have also concluded this discussion here - > https://stackoverflow.com/questions/63003881/apache-nutch-1-16-fetcher-reducers/. > So if you want to add something here, please feel free to do so. > > Regard

Re: Apache Nutch 1.16 Fetcher reducers?

2020-07-21 Thread Sebastian Nagel
time for sure since Fetcher will be directly creating the final >> avro format that I need. So the only question remains is that if I do >> fetcher.parse=true, can I get rid of parse Job as a separate step >> completely. >> >> Regards >> Prateek >> >> O

Re: Apache Nutch 1.16 Fetcher reducers?

2020-07-21 Thread Sebastian Nagel
y indexers. In the > avro conversion step, we just convert data into avro schema > and dump to HDFS. Do you think we still need reducers in the fetch phase? > FYI- I tried running with 0 reducers and don't see any impact as > such. > > Appreciate your help. > > Re

Re: Apache Nutch 1.16 Fetcher reducers?

2020-07-21 Thread Sebastian Nagel
Hi Prateek, you're right there is no specific reducer used but without a reduce step the segment data isn't (re)partitioned and the data isn't sorted. This was a strong requirement once Nutch was a complete search engine and the "content" subdir of a segment was used as page cache. Getting the con

[ANNOUNCE] Apache Nutch 1.17 Release

2020-07-02 Thread Sebastian Nagel
The Apache Nutch team is pleased to announce the release of Apache Nutch v1.17. Nutch is a well matured, production ready Web crawler. Nutch 1.x enables fine grained configuration, relying on Apache Hadoop™ data structures. Source and binary distributions are available for download from the Apach

[RESULT] was [VOTE] Release Apache Nutch 1.17 RC#1

2020-07-01 Thread Sebastian Nagel
Hi Folks, thanks to everyone who was able to review the release candidate! 72 hours have passed, please see below for vote results. [4] +1 Release this package as Apache Nutch 1.17 Markus Jelsma * Furkan Kamaci * Shashanka Balakuntala Srinivasa Sebastian Nagel * [0] -1 Do not

Re: protocol-interactiveselenium Custom Handler

2020-06-25 Thread Sebastian Nagel
Hi Craig, in case, you're building Nutch from the git repo or from the source package the easiest way is to put the file NewCustomHandler.java into src/plugin/protocol-interactiveselenium/src/java/.../handlers/ and run ant runtime to compile and package Nutch including package your custom hand

[VOTE] Release Apache Nutch 1.17 RC#1

2020-06-18 Thread Sebastian Nagel
Hi Folks, A first candidate for the Nutch 1.17 release is available at: https://dist.apache.org/repos/dist/dev/nutch/1.17/ The release candidate is a zip and tar.gz archive of the binary and sources in: https://github.com/apache/nutch/tree/release-1.17 In addition, a staged maven reposito

Preparing to release 1.17

2020-06-16 Thread Sebastian Nagel
Hi, the list of open issues for 1.17 became short, and I will move some of the remaining issues to 1.18 to get the way free and prepare the first release candidate in the next two days. If there are urgent fixes (including a PR / patch). Let me know! Thanks, Sebastian

Re: Nutch 1.17 download available?

2020-06-07 Thread Sebastian Nagel
Hi Jim, Nutch 1.17 should land soon but there are a couple of issue to be fixed before the release. Best, Sebastian On 6/8/20 12:11 AM, Lewis John McGibbney wrote: > Hi Jim, > Response below > > On 2020/06/06 14:23:24, Jim Anderson wrote: >> >> I cannot find a download for Nutch 1.17. Is Nu

Re: [Non-DoD Source] Re: [DISCUSS] Release 1.17 ? (UNCLASSIFIED)

2020-04-23 Thread Sebastian Nagel
t; >> >> user Digest 23 Apr 2020 06:27:46 - Issue 3055 >> >> Topics (messages 34517 through 34517) >> >> [DISCUSS] Release 1.17 ? >> 34517 by: Sebastian Nagel >> >> Administrivia: >> >> --

[DISCUSS] Release 1.17 ?

2020-04-22 Thread Sebastian Nagel
Hi all, 30 issues are done now https://issues.apache.org/jira/browse/NUTCH/fixforversion/12346090 including a number of important dependency upgrades: - Hadoop 3.1 (NUTCH-2777) - Elasticsearch 7.3.0 REST client (NUTCH-2739) Thanks to Shashanka Balakuntala Srinivasa for both! Dependency upgrade

Re: finding broken links with nutch 1.14

2020-03-03 Thread Sebastian Nagel
Hi Robert, 404s are recorded in the CrawlDb after the tool "updatedb" is called. Could you share the commands you're running? Please also have a look into the log files (esp. the hadoop.log) - all fetches are logged and also whether fetches have failed. If you cannot find a log message for the br

Re: Extracting XMP metadata from PDF for indexing Nutch 1.15

2020-01-15 Thread Sebastian Nagel
promising. Hope you enjoy the holiday! > > Joe > > -Original Message----- > From: Sebastian Nagel > Sent: Thursday, January 2, 2020 7:42 AM > To: user@nutch.apache.org > Subject: Re: Extracting XMP metadata from PDF for indexing Nutch 1.15 > > Hi Joseph, >

Re: Extracting XMP metadata from PDF for indexing Nutch 1.15

2020-01-02 Thread Sebastian Nagel
Hi Joseph, this could be related to https://issues.apache.org/jira/browse/NUTCH-2525 caused by not-all-lowercase meta keys. I'm happy to check whether the attached patch fixes your problem when I'm back from holidays in a few days. Best, Sebastian On 12/31/19 5:43 PM, Gilvary, Joseph wrote:

Re: Fwd: Crawling 3 websites from one nutch

2019-12-27 Thread Sebastian Nagel
Hi, the test compares names of the "host" and the registered domain: doc.getFieldValue('host')=='urgenthomework.com' The host name is "www.urgenthomework.com". You can test it via: $> bin/nutch indexchecker https://www.urgenthomework.com/ fetching: https://www.urgenthomework.com/ ... h

Re: Fetch failed with protocol status: gone(11)

2019-12-17 Thread Sebastian Nagel
avalonpontoons.com/ > robots.txt whitelist not configured. > Fetch failed with protocol status: gone(11), lastModified=0: > https://www.avalonpontoons.com/ > > > On Tue, Dec 17, 2019 at 11:53 AM Sebastian Nagel > wrote: > >> Hi Bob, >> >> the relevant Javadoc commen

Re: Fetch failed with protocol status: gone(11)

2019-12-17 Thread Sebastian Nagel
Hi Bob, the relevant Javadoc comment stands before the declaration of a variable (here a constant): /** Resource is gone. */ public static final int GONE = 11; More detailed, GONE results from one of the following HTTP status codes: 400 Bad request 401 Unauthorized 410 Gone (*forever* g

Re: Map reducer filtering too many sites during generation in Nutch 2.4

2019-11-14 Thread Sebastian Nagel
Hi Makkara, > but I believe that this is the fault of the reducer > Map input records=22048 > Map output records=4 The items are skipped in the mapper. > Is this a known problem of Nutch 2.4, or have I just misconfigured > something? Could be the configuration or

  1   2   3   4   5   6   7   8   >