Hi James,
thanks for the update! Would you mind to share your solution?
Just thinking about the next user searching for the same problem...
Otherwise: I never indexed the GeoIP domain. And yes, the index-geoip
plugin isn't easy to configure, see
https://nutch.apache.org/documentation/javadoc/
Hi Markus,
>> And i do not agree with it. Almost all content is compressed now, so this
>> will never work. We need the headers and response code stored for WARC
>> export and do not care about an incorrect length header.
No, don't do this. You need to rewrite the header. There are many WARC rea
Hi Sheham,
the nutch-site.xml configures
mapreduce.task.timeout
1800
1.8 seconds (1800 milliseconds) is very short. The default is 600 seconds or 10
minutes, see [1]. Since Nutch needs to finish fetching before the task timeout
applies, threads fetching not quickly enough and st
, see
https://github.com/sebastian-nagel/nutch-test-single-node-cluster/
One note about the CHANGES.md: it's now a mixture of HTML and plain text.
It does not use the potential of markdown, e.g. sections / headlines for
the releases to make the change log navigable via a table of contents.
Th
Hi Tim,
>> I'm using the okhttp protocol, because I don't think the http protocol
>> stores truncation information.
protocol-http could mark truncations as well, however. Please, also open an
issue for this and other protocol plugins.
>> Should I open a ticket to have ParseSegment also check
Hi Michael,
> I wonder if there is not already a build-in option to exclude HTML
> elements (like a div with a given id or class or other elements like header).
No, there isn't one so far.
> I know https://issues.apache.org/jira/browse/NUTCH-585
> I also do not understand why this little patc
Hi,
yes, this is possible by pointing the environment variable
NUTCH_LOG_DIR to a different folder.
The default is: $NUTCH_HOME/logs/
See also the script bin/nutch which is called by bin/crawl:
https://github.com/apache/nutch/blob/master/src/bin/nutch#L30
(it's also possible to change the log f
e Cohen
On Wed, Jul 26, 2023 at 10:36 AM Sebastian Nagel
wrote:
Hi Steve,
>
file:/RMS/sha256/a0/ec/b0/a0/e0/ef/80/74/a0ecb0a0e0ef80747871563e2060b028c3abd330cb644ef7ee86fa9b133cbc67
what does the file contain? An .eml file (following RFC822)?
Would it be possible to share this file or at l
Hi Steve,
>
file:/RMS/sha256/a0/ec/b0/a0/e0/ef/80/74/a0ecb0a0e0ef80747871563e2060b028c3abd330cb644ef7ee86fa9b133cbc67
what does the file contain? An .eml file (following RFC822)?
Would it be possible to share this file or at least a chunk large
enough to reproduce the issue?
The error message
Dear all,
It is my pleasure to announce that Tim Allison has joined us
as a committer and member of the Nutch PMC.
You may already know Tim as a maintainer of and contributor to
Apache Tika. So, it was great to see contributions to the
Nutch source code from an experienced developer who is also
Hi Eric,
unfortunately, on Windows you also need to download and install winutils.exe and
hadoop.dll,
see
https://github.com/cdarlint/winutils and
https://stackoverflow.com/questions/41851066/exception-in-thread-main-java-lang-unsatisfiedlinkerror-org-apache-hadoop-io
The installation of Ha
Hi Kamil,
> I was wondering if this script is advisable to use?
I haven't tried the script itself but some of the underlying commands
- mergedb, etc.
> merge command ($nutch_dir/nutch merge $index_dir $new_indexes)
Of course, some of the commands are obsolete. Long time ago, Nutch
used Lucene
Hi,
please send a mail to
user-unsubscr...@nutch.apache.org
See
https://nutch.apache.org/community/mailing-lists/
Thanks!
Best,
Sebastian
On 1/25/23 14:53, Steven Zhu wrote:
Please unsubscribe me from the users list.
Steven
On Tue, Jan 24, 2023 at 10:27 PM Ankit gupta
wrote:
Hell
owse/NUTCH-2974
Just in case you want to try it.
~Sebastian
On 11/21/22 10:36, Sebastian Nagel wrote:
Hi Kamil,
thanks for trying and finding a solution! I've open a JIRA issue to track the
problem: https://issues.apache.org/jira/browse/NUTCH-2974
Thanks!
Sebastian
On 11/19/22 18:37
Hadoop cluster. All commands are the same than in fully
distributed mode.
If it helps, I prepared some setup scripts to run Nutch in pseudo-distributed
mode:
https://github.com/sebastian-nagel/nutch-test-single-node-cluster
Best,
Sebastian
On 1/15/23 04:26, Mike wrote:
I will now try to confi
Hi Mike,
> It can be tedious to set up for the first time, and there are many components.
In case you prefer Linux packages, I can recommend Apache Bigtop, see
https://bigtop.apache.org/
and for the list of package repositories
https://downloads.apache.org/bigtop/stable/repos/
~Sebastian
Hi Paul,
> the indexer was writing the
> documents info in the file (nutch.csv) twice,
Yes, I see. And now I know what I've overseen:
.../bin/nutch index -Dmapreduce.job.reduces=2
You need to run the CSV indexer with only a single reducer.
In order to do so, please pass the option
--num-tas
Hi Paul,
as far I can see the indexer is run only once and now indexes 26 documents:
org.apache.nutch.indexer.IndexingJob 2022-11-22 06:32:57,164 INFO
o.a.n.i.IndexingJob [main] Indexer: 26 indexed (add/update)
The logs also indicate that both segments are indexed at once:
org.apache.nu
Hi Kamil,
thanks for trying and finding a solution! I've open a JIRA issue to track the
problem: https://issues.apache.org/jira/browse/NUTCH-2974
Thanks!
Sebastian
On 11/19/22 18:37, Kamil Mroczek wrote:
I've been able to work around this issue by adding "pattern" to touch tag
on line 101 i
Hi everybody,
because of a growing number of spam account creation public sign-ups to the
Apache JIRA have been disabled.
In order to allow users to report bugs, we have two options:
1 either users let us know about the issue on the mailing list and one of the
Nutch PMC creates a user account
Hi Paul,
yes, the CSV indexer removes the CSV output before it starts a new one.
The problem here is that the indexer is run twice in a loop.
Possible work-arounds - assumed you're using the script bin/crawl:
1 after each indexing command in the loop, move the CSV output so that
it gets not d
Hi Mike, hi Markus,
there's also
https://issues.apache.org/jira/browse/NUTCH-1806
which would make it much easier to keep up-to-date with the public suffix list.
Resp., because crawler-commons loads the public suffix list
(for historic reasons named "effective_tld_names.dat") from the class pa
The Apache Nutch team is pleased to announce the release of
Apache Nutch v1.19.
Nutch is a well matured, production ready Web crawler. Nutch 1.x enables
fine grained configuration, relying on Apache Hadoop™ data structures.
Source and binary distributions are available for download from the
Apach
Hi Folks,
thanks to everyone who was able to review the release candidate!
72 hours have definitely passed, please see below for vote results.
[4] +1 Release this package as Apache Nutch 1.19
Markus Jelsma *
BlackIce *
Jorge Betancourt *
Sebastian Nagel *
[0] -1 Do not release this
>
> Thanks
> Mike
>
> Am Fr., 2. Sept. 2022 um 13:25 Uhr schrieb Sebastian Nagel
> :
>
>> Hi Mike,
>>
>> the Nutch/Solr schema.xml will be updated with the release of 1.19
>> (expected
>> soon, a vote about RC#1 is ongoing):
>> [NUTCH-
es
> file in the cache.
>
> Since Ralf can compile it without problems, it seems to be an issue on my
> machine only. So Nutch seems fine, therefore +1.
>
> Regards,
> Markus
>
> [1]
> https://repo1.maven.org/maven2/org/apache/httpcomponents/httpasyncclient/4.1.4/
>
&
Hi Mike,
the Nutch/Solr schema.xml will be updated with the release of 1.19 (expected
soon, a vote about RC#1 is ongoing):
[NUTCH-2955] - replace deprecated/removed field type solr.LatLonType
[NUTCH-2957] - add fall-back field definitions for unknown index fields
[NUTCH-2956] - typos in field n
pl/StaticLoggerBinder.class]
>>>>
>>>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
>>>> explanation.
>>>> SLF4J: Actual binding is of type
>>>> [org.apache.logging.slf4j.Log4jLoggerFactory]
>>>>
nitialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
> more info.
>
> I am worried about the indexer-elastic plugin, maybe others have that
> problem too? Otherwise everything seems fine.
>
> Markus
>
> Op ma
://github.com/sebastian-nagel/nutch-test-single-node-cluster/)
Jelsma wrote:
> Sounds good!
>
> I see we're still at Tika 2.3.0, i'll submit a patch to upgrade to the
> current 2.4.1.
>
> Thanks!
> Markus
>
> Op di 9 aug. 2022 om 09:11 schreef Sebastian Nagel :
>
>> Hi all,
>>
>> more than 60 issues
Hi all,
more than 60 issues are done for Nutch 1.19
https://issues.apache.org/jira/projects/NUTCH/versions/12349580
including
- important dependency upgrades
- Hadoop 3.3.3
- Any23 2.7
- Tika 2.3.0
- plugin-specific URL stream handlers (NUTCH-2429)
- migration
- from Java/JDK 8
Fyi, the issue is tracked on
https://issues.apache.org/jira/browse/NUTCH-2955
~Sebastian
On 7/14/22 12:54, Sebastian Nagel wrote:
> Hi Mike,
>
> if you do not use the plugin index-geoip, you could simply delete the line
>
> subFieldSuffix="_coordinate&
Hi Rastko,
the description isn't really correct now as NUTCH_HOME is supposed to point to
the runtime
- if the binary package is used: this is the base folder of the package,
eg. apache-nutch-1.18/
- if Nutch is built from the source, you usually point NUTCH_HOME to
runtime/local/ - the dire
Hi Bob,
could you share which instructions and when the error happens - during import,
project build, running/debugging?
The usual way is
1. to write the Eclipse project configuration, run
ant eclipse
2. import the written project configuration into Eclipse
Building or running/debugging N
Hi Mike,
if you do not use the plugin index-geoip, you could simply delete the line
Otherwise, after the deprecation and the removal of the LatLonType class [1],
it should be:
But I haven't verified whether indexing with index-geoip enabled and the
retrieval works.
In any case, please
Hi Michael,
Nutch (1.18, and trunk/master) should work together with more recent Hadoop
versions.
At Common Crawl we use a modified Nutch version based on the recent trunk
running on Hadoop 3.2.2 (soon 3.2.3) and Java 11, even on a mixed Hadoop cluster
with x64 and arm64 AWS EC2 instances.
But I
Hi Michael,
the only differences in the protocol-httpclient plugin between Nutch 1.11 and
1.13 are
- NUTCH-2280 [1] which allows to configure the cookie policy
- NUTCH-2355 [2] which allows to set an explicit cookie for a request URL
Could this be related?
Are there any useful hints what could b
indexed data to MongoDB for further
> processing.
>
> Kind regards,
> Roseline
>
>
>
>
>
> -Original Message-
> From: Sebastian Nagel
> Sent: 12 January 2022 16:12
> To: user@nutch.apache.org
> Subject: Re: Nutch not crawling all URLs
>
nonnegative (>=0), content longer
> than it will be truncated; otherwise, no truncation at all. Do not
> confuse this setting with the file.content.limit setting.
>
>
>
> db.ignore.external.links.mode
> byHost
>
>
> db.injector.overwrite
> true
&g
Hi Ayhan,
you mean?
https://stackoverflow.com/questions/69352136/nutch-does-not-crawl-sites-that-allows-all-crawler-by-robots-txt
Sebastian
On 12/13/21 20:59, Ayhan Koyun wrote:
> Hi,
>
> as I wrote before, it seems that I am not the only one who can not crawl all
> the seed.txt url's. I could
Hi Roseline,
> 5,36405,0,http://www.notco.com
What is the status for https://notco.com/which is the final redirect
target?
Is the target page indexed?
~Sebastian
> Dr Roseline Antai
> Research Fellow
> Hunter Centre for Entrepreneurship
> Strathclyde Business School
> University of Strathclyde, Glasgow, UK
>
>
> The University of Strathclyde is a charitable body, registered in Scotland,
> number SC015263.
>
>
> -
Hi Shi Wei,
fyi: a fix for NUTCH-2903 is ready
https://github.com/apache/nutch/pull/703
Sebastian
On 11/16/21 13:54, Sebastian Nagel wrote:
> Hi Shi Wei,
>
> looks like you're the first trying to connect to ES from Nutch over
> HTTPS. HTTP is used as default scheme and t
The issue is now tracked in
https://issues.apache.org/jira/browse/NUTCH-2907
On 10/28/21 15:31, Sebastian Nagel wrote:
> Hi Shi Wei,
>
> sorry, but it looks like the Selenium protocol plugin has never been
> used with a proxy over https. There are two points which need (at a
>
gt; following in the log4j.properties but it doesn't help.
>
> log4j.logger.org.apache.nutch.indexwriter.elastic.ElasticIndexWriter=WARN,cmdstdout
> log4j.logger.org.apache.nutch.indexwriter.elastic.ElasticUtils=WARN,cmdstdout
>
>
> Best Regards,
> Shi Wei
>
> O
Hi Shi Wei,
looks like you're the first trying to connect to ES from Nutch over
HTTPS. HTTP is used as default scheme and there is no way to configure
the Elasticsearch index writer to use HTTPS.
Please open a Jira issue. It's a trivial fix.
For a quick fix: in the Nutch source package (or git
Hi Max,
fyi, the Jira issue is created:
https://issues.apache.org/jira/browse/NUTCH-2902
(to make sure that this is not forgotten)
Thanks,
Sebastian
On 10/11/21 18:11, Sebastian Nagel wrote:
> Hi Max,
>
>> I was able to fix this by switching from JexlExpression to JexlScript.
Hi Shi Wei,
there is a way, although definitely not the recommended one.
Sorry, and it took me a little bit to proof it.
Do you know about external XML entities or XXE attacks?
1. On top of the index-writers.xml you add an entity declaration:
]>
2. it's used later in the index writer spec:
Hi Shi Wei,
sorry, but it looks like the Selenium protocol plugin has never been
used with a proxy over https. There are two points which need (at a
first glance) a rework:
1. the protocol tries to establish a TLS/SSL connection to the proxy if
the URL to be crawled is a https:// URL. There might
HTTP Authentication Scheme
>
> Your sincerely,
> Shi Wei
>
> -Original Message-
> From: Sebastian Nagel
> Sent: Monday, 25 October, 2021 5:31 PM
> To: user@nutch.apache.org
> Subject: Re: Encrypt or Mask the password
>
> Hi Shi Wei,
>
> for the
Hi Shi Wei,
for the nutch-site.xml it's possible to use Java properties and/or
environment variables,
see section "Variable expansion" in
https://hadoop.apache.org/docs/r3.3.1/api/org/apache/hadoop/conf/Configuration.html
In case you're asking about index-writers.xml - variable expansion (likely
SharePoint Team Requirements Analyst
(443) 861-8623
APG Bldg 6002 D5101/108
I am currently teleworking and can be reached at CELL - (860) 670 9494
-Original Message-----
From: Sebastian Nagel
Sent: Friday, October 22, 2021 5:46 AM
To: user@nutch.apache.org
Subject: [Non-DoD Source] Re: Cant
tps://solr.apache.org/guide/8_5/kerberos-authentication-plugin.html#using-solrj-with-a-kerberized-solr
Thanks,
Sebastian
On 10/22/21 12:01 PM, sw.l...@quandatics.com wrote:
Hi Sebastian,
Here is the index-writers.xml you requested. Thank
Your Sincerely,
Shi Wei
-Original Message-
From: Sebastian Na
Hi Shi Wei,
could you also share the index writer configuration (conf/index-writers.xml)?
The default is unauthenticated access to Solr, see the snippet below.
The file httpclient-auth.xml is not relevant for the Solr indexer, it's
used if a crawled web site requires authentication in order to f
Hi Max,
> I was able to fix this by switching from JexlExpression to JexlScript. I
> have a small patch that I'm happy to contribute!
Yes, that would be great! Please open also a Jira issue so that the
problem shows up in the Changelog.
Thanks!
Best,
Sebastian
On 10/11/21 6:34 AM, Max Ockner
Hi Markus,
the okhttp protocol plugin should work out-of-the-box
and we use it in production (currently on Hadoop 3.2.2)
I remember that I had once an issue with the Hadoop library
having okhttp as a dependency which then caused a conflict.
It was solved by adding an exclusion rule to the Hadoop
Hi Clark,
thanks for summarizing this discussion and sharing the final configuration!
Good to know that it's possible to run Nutch on Hadoop using S3A without
using HDFS (no namenode/datanodes running).
Best,
Sebastian
> The local file system? Or hdfs:// or even s3:// resp. s3a://?
Also important: the value of "mapreduce.job.dir" - it's usually
on hdfs:// and I'm not sure whether the plugin loader is able to
read from other filesystems. At least, I haven't tried.
On 6/15/21 10:
Hi Clark,
sorry, I should read your mail until the end - you mentioned that
you downgraded Nutch to run with JDK 8.
Could you share to which filesystem does NUTCH_HOME point?
The local file system? Or hdfs:// or even s3:// resp. s3a://?
Best,
Sebastian
On 6/15/21 10:24 AM, Clark Benham wrote:
Hi Clark,
the class URLNormalizer is not in a plugin - it's part of Nutch core and defines the interface for URL normalizer plugins. Looks like
there's something wrong fundamentally, not only with the plugins.
> I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3
Are you aware that the N
Hi Gorkem,
I haven't verified it by trying - but it may be that given your configuration
the Solr instance isn't reachable via
http://localhost:8983/solr/nutch
Inside the Docker network, host names are the same as container names, that is
http://solr:8983/solr/nutch
might work. Cf. the docker
Hi Lewis, hi Markus,
> snappy compression, which is a massive improvement for large data shuffling
jobs
Yes, I can confirm this. Also: it's worth to consider zstd for all data kept for
longer. We use it for a 25-billion CrawlDB: it's almost as fast (both
compression
and decompression) as snapp
Hi Nicholas,
thanks for the pointer.
> What is the status of that project?
It's definitely alive. And looks like it has improved recently,
just compare the support for Linux distributions of the last two releases:
https://mirror.synyx.de/apache/bigtop/bigtop-1.4.0/repos/
https://mirror.syny
Thanks! Interesting that the dublexweb bot ignores the wildcard user agent
rules by default.
On 6/3/21 11:44 PM, lewis john mcgibbney wrote:
Some interesting content for a short read :)
https://www.seroundtable.com/duplexweb-google-bot-31522.html?utm_source=search_engine_roundtable&utm_campaig
m/big-data-europe/docker-hadoop
[2]
https://github.com/sebastian-nagel/docker-hadoop/tree/2.0.0-hadoop3.3.0-java11
in/crawl file.
Although looking at it now it's clear.
This makes it easier for me to access the html content within my plugin,
thanks again
On Fri, May 28, 2021 at 8:36 PM Sebastian Nagel
wrote:
Hi Kieran,
see the command-line options
-addBinaryContent
index
Hi Kieran,
see the command-line options
-addBinaryContent
index raw/binary content in field `binaryContent`
-base64
use Base64 encoding for binary content
of the Nutch index job [1]. Note that the content maybe indeed
binary, eg. for PDF documents but also
Hi Prateek,
alternatively, you could modify the URLPartitioner [1], so that during the
"generate" step
the URLs of a specific host or domain are distributed over more partitions. One
partition
is the fetch list of one fetcher map task. At Common Crawl we partition by
domain and made
the numbe
nutchplugin.http.Http: fetching
https://zyfro.com/ <https://zyfro.com/>
2021-05-05 17:35:30,786 INFO [main] org.apache.nutch.fetcher.FetcherThread:
FetcherThread 50 has no more work available/
I am not sure what I am missing.
Regards
Prateek
On Thu, May 6, 2021 at 10:21 AM Sebastian Nagel mailto:wastl.na...@goo
Hi Lewis,
> 2) post-processing (Nutch) Hadoop sequence data by converting it to Parquet
format?
Yes, but not directly - it's a multi-step process. The outcome:
https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
This Parquet index is optimized by sorting the row
Hi Andrew,
> if this flag is used *--sitemaps-from-hostdb always*
Do the crawled hosts announce the sitemap in their robots.txt?
If not does the sitemap URLs follow the pattern
http://example.com/sitemap.xml ?
See https://cwiki.apache.org/confluence/display/NUTCH/SitemapFeature
If this is no
Hi Prateek,
are there any URL filters which filter away image links?
You can verify this using the URL filter checker:
echo "https://example.com/image.jpg"; \
| bin/nutch filterchecker -stdin
The default rules in conf/regex-urlfilter.txt exclude common
image suffixes. Note that there can b
Hi,
no, NUTCH-2353 is still open, see
https://issues.apache.org/jira/projects/NUTCH/issues/NUTCH-2353
The implementation caused a regression, so it was reverted.
Best,
Sebastian
On 12/6/20 7:03 AM, Von Kursor wrote:
> Hello
>
> Has this API enhancement been implemented under 1.17 ?
>
> I wa
Hi,
> Nutch 2.4 with selenium
Nutch 2.4 does not include any plugin to use Selenium. In addition, 2.4 is for
now the last release on the 2.x branch which is not
maintained anymore. You should use 1.x (1.17 is the
most recent release.
> standalone nutch crawling with selenium.
For 1.x there's a
Hi,
this question is better asked on the Solr user mailing list
as Nutch people are not necessarily familiar with Solr on a deep level.
Please also share more details - which JavaScript client, the error message,
the log messages of the Solr server at this time. This helps to trace the
error down
from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
>
> From: Sebastian Nagel<mailto:wastl.na...@googlemail.com.INVALID>
> Sent: Tuesday, August 11, 2020 4:56 PM
> To: user@nutch.apache.org<mailto:user@nutch.apache.org>
> Subject: Re: Regarding N
-196X
> http://www.researcherid.com/rid/F-3388-2013
>
>
>
>
>
>
>
>
> Sebastian Nagel , 13 Ağu 2020 Per,
> 08:53 tarihinde şunu yazdı:
>
>> Hi Joe,
>>
>>> I eliminated it when I updated the index-writers.xml for the
>> solr_indexer_1
>&g
Hi Joe,
> I eliminated it when I updated the index-writers.xml for the solr_indexer_1
> to use only a single URL.
Thanks for the hint. I'm able to reproduce the error by adding an overlong URL
to
Could you open an issue to fix this on
https://issues.apache.org/jira/projects/NUTCH ?
Tha
Hi,
Nutch does not include a search component anymore. These steps are obsolete.
All you need is to setup your Hadoop cluster, then run
$NUTCH_HOME/runtime/deploy/bin/nutch ...
(instead of .../runtime/local/bin/nutch ...)
Alternatively, you could launch a Nutch tool, eg. Injector
the followin
Dear all,
it is my pleasure to announce that Shashanka Balakuntala Srinivasa has joined us
as a committer and member of the Nutch PMC. Shashanka Balakuntala has worked
recently
on a longer list of Nutch issues and improvements.
Thanks, Shashanka Balakuntala, and congratulations on your new role
to fetch Job directly
> so see if there are some improvements.
>
> I have also concluded this discussion here -
> https://stackoverflow.com/questions/63003881/apache-nutch-1-16-fetcher-reducers/.
> So if you want to add something here, please feel free to do so.
>
> Regard
time for sure since Fetcher will be directly creating the final
>> avro format that I need. So the only question remains is that if I do
>> fetcher.parse=true, can I get rid of parse Job as a separate step
>> completely.
>>
>> Regards
>> Prateek
>>
>> O
y indexers. In the
> avro conversion step, we just convert data into avro schema
> and dump to HDFS. Do you think we still need reducers in the fetch phase?
> FYI- I tried running with 0 reducers and don't see any impact as
> such.
>
> Appreciate your help.
>
> Re
Hi Prateek,
you're right there is no specific reducer used but without a reduce step
the segment data isn't (re)partitioned and the data isn't sorted.
This was a strong requirement once Nutch was a complete search engine
and the "content" subdir of a segment was used as page cache.
Getting the con
The Apache Nutch team is pleased to announce the release of
Apache Nutch v1.17.
Nutch is a well matured, production ready Web crawler. Nutch 1.x enables
fine grained configuration, relying on Apache Hadoop™ data structures.
Source and binary distributions are available for download from the
Apach
Hi Folks,
thanks to everyone who was able to review the release candidate!
72 hours have passed, please see below for vote results.
[4] +1 Release this package as Apache Nutch 1.17
Markus Jelsma *
Furkan Kamaci *
Shashanka Balakuntala Srinivasa
Sebastian Nagel *
[0] -1 Do not
Hi Craig,
in case, you're building Nutch from the git repo or from the source package
the easiest way is to put the file NewCustomHandler.java into
src/plugin/protocol-interactiveselenium/src/java/.../handlers/
and run
ant runtime
to compile and package Nutch including package your custom hand
Hi Folks,
A first candidate for the Nutch 1.17 release is available at:
https://dist.apache.org/repos/dist/dev/nutch/1.17/
The release candidate is a zip and tar.gz archive of the binary and sources in:
https://github.com/apache/nutch/tree/release-1.17
In addition, a staged maven reposito
Hi,
the list of open issues for 1.17 became short, and I will move some of the
remaining
issues to 1.18 to get the way free and prepare the first release candidate in
the
next two days.
If there are urgent fixes (including a PR / patch). Let me know!
Thanks,
Sebastian
Hi Jim,
Nutch 1.17 should land soon but there are a couple of issue to be fixed before
the release.
Best,
Sebastian
On 6/8/20 12:11 AM, Lewis John McGibbney wrote:
> Hi Jim,
> Response below
>
> On 2020/06/06 14:23:24, Jim Anderson wrote:
>>
>> I cannot find a download for Nutch 1.17. Is Nu
t;
>>
>> user Digest 23 Apr 2020 06:27:46 - Issue 3055
>>
>> Topics (messages 34517 through 34517)
>>
>> [DISCUSS] Release 1.17 ?
>> 34517 by: Sebastian Nagel
>>
>> Administrivia:
>>
>> --
Hi all,
30 issues are done now
https://issues.apache.org/jira/browse/NUTCH/fixforversion/12346090
including a number of important dependency upgrades:
- Hadoop 3.1 (NUTCH-2777)
- Elasticsearch 7.3.0 REST client (NUTCH-2739)
Thanks to Shashanka Balakuntala Srinivasa for both!
Dependency upgrade
Hi Robert,
404s are recorded in the CrawlDb after the tool "updatedb" is called.
Could you share the commands you're running? Please also have a look into the
log files (esp. the
hadoop.log) - all fetches are logged and
also whether fetches have failed. If you cannot find a log message
for the br
promising. Hope you enjoy the holiday!
>
> Joe
>
> -Original Message-----
> From: Sebastian Nagel
> Sent: Thursday, January 2, 2020 7:42 AM
> To: user@nutch.apache.org
> Subject: Re: Extracting XMP metadata from PDF for indexing Nutch 1.15
>
> Hi Joseph,
>
Hi Joseph,
this could be related to
https://issues.apache.org/jira/browse/NUTCH-2525
caused by not-all-lowercase meta keys.
I'm happy to check whether the attached patch fixes your problem
when I'm back from holidays in a few days.
Best,
Sebastian
On 12/31/19 5:43 PM, Gilvary, Joseph wrote:
Hi,
the test compares names of the "host" and the registered domain:
doc.getFieldValue('host')=='urgenthomework.com'
The host name is "www.urgenthomework.com". You can test it via:
$> bin/nutch indexchecker https://www.urgenthomework.com/
fetching: https://www.urgenthomework.com/
...
h
avalonpontoons.com/
> robots.txt whitelist not configured.
> Fetch failed with protocol status: gone(11), lastModified=0:
> https://www.avalonpontoons.com/
>
>
> On Tue, Dec 17, 2019 at 11:53 AM Sebastian Nagel
> wrote:
>
>> Hi Bob,
>>
>> the relevant Javadoc commen
Hi Bob,
the relevant Javadoc comment stands before the declaration of a variable (here
a constant):
/** Resource is gone. */
public static final int GONE = 11;
More detailed, GONE results from one of the following HTTP status codes:
400 Bad request
401 Unauthorized
410 Gone (*forever* g
Hi Makkara,
> but I believe that this is the fault of the reducer
> Map input records=22048
> Map output records=4
The items are skipped in the mapper.
> Is this a known problem of Nutch 2.4, or have I just misconfigured
> something?
Could be the configuration or
1 - 100 of 719 matches
Mail list logo