This is an automated email from the ASF dual-hosted git repository.
snagel pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git.
from 961c725 NUTCH-2034 CrawlDB update job to count documents in CrawlDb
rejected by URL filters (patch contributed by Luis Lopez)
add 62f6d9f Add a new IndexingFilter that uses JEXL to decide whether to
index a document.
add 36bfac1 Some improvements based on revewier's feedback.
add d72591a Better tests.
add c7c795a Merge branch 'master' of https://github.com/apache/nutch into
index-jexl-filter
add a985e30 Fixed per reviewers' comments. Changed the package name to be
more specific, added package-info.java, added to more build targets.
add bea8621 doclint does not like self-closing tags.
new 34236ff fix for NUTCH-2370 contributed by [email protected]
new d758a31 NUTCH-2474 CrawlDbReader -stats fails with ClassCastException
- replace CrawlDbStatCombiner by CrawlDbStatReducer and ensure that data is
properly processed independently whether and how often combiner is called -
simplify calculation of minimum and maximum
new 26669eb - filter out NaN scores which break the quantile calculation
new 194fc37 Extend indexer-elastic-rest to support languages
new 153525c fix formatting
new 5ccebc9 add languages to default config
new 9fcc2a4 fix delete
new 42bdc65 NUTCH-2439 Upgrade Apache Tika dependency to 1.17
new 2be2052 Add tika-config.xml to suppress Tika warnings on stderr
new e0326de make fully configurable
new e7b077e NUTCH-2480 Upgrade crawler-commons dependency to 0.9
new 52a1c50 fix indentation
new 67dc52c scope variables
new 416c457 NUTCH-2354 Upgrade Hadoop dependencies to 2.7.4
new e7d5c13 NUTCH-2362 Upgrade MaxMind GeoIP version in index-geoip
new e0e06f5 NUTCH-2035 urlfilter-regex case insensitive rules
new 35193c2 NUTCH-2478 HTML parser should resolve base URL <base
href=...> - fix parse-html and parse-tika - add unit test for parse-html
new 8f692d1 NUTCH-2478 HTML parser should resolve base URL <base
href=...> - finally fix parse-tika: - href attribute of base element dropped
in DOM - need to call tikamd.get("Content-Location") - port HTML parser test
from parse-html to parse-tika - add method to DomUtil which prints
DocumentFragment
new 4da6b19 fix for NUTCH-2477 (refactor checker classes) contributed by
Jurian Broertjes
new 9fb5777 Improve command-line help for URL filter and normalizer
checker
new 22fc7f0 NUTCH-2322 URL not available for Jexl operations - apply
patch contributed by Markus Jelsma
new e0a27c7 NUTCH-2034 CrawlDB update job to count documents in CrawlDb
rejected by URL filters (patch contributed by Luis Lopez)
new fc89e4f NUTCH-2415 Create a JEXL based IndexingFilter Merge branch
'pipldev-index-jexl-filter', closes #219
The 23 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails. The revisions
listed as "add" were already present in the repository and have only
been added to this reference.
Summary of changes:
build.xml | 4 +
conf/nutch-default.xml | 18 +++
default.properties | 1 +
src/plugin/build.xml | 2 +
.../{headings => index-jexl-filter}/build.xml | 6 +-
.../ivy.xml | 0
.../plugin.xml | 14 +--
.../nutch/indexer/jexl/JexlIndexingFilter.java | 131 +++++++++++++++++++++
.../apache/nutch/indexer/jexl}/package-info.java | 16 ++-
.../nutch/indexer/jexl/TestJexlIndexingFilter.java | 124 +++++++++++++++++++
10 files changed, 301 insertions(+), 15 deletions(-)
copy src/plugin/{headings => index-jexl-filter}/build.xml (88%)
copy src/plugin/{urlnormalizer-slash => index-jexl-filter}/ivy.xml (100%)
copy src/plugin/{mimetype-filter => index-jexl-filter}/plugin.xml (74%)
create mode 100644
src/plugin/index-jexl-filter/src/java/org/apache/nutch/indexer/jexl/JexlIndexingFilter.java
copy
src/plugin/{scoring-similarity/src/java/org/apache/nutch/scoring/similarity/util
=> index-jexl-filter/src/java/org/apache/nutch/indexer/jexl}/package-info.java
(51%)
create mode 100644
src/plugin/index-jexl-filter/src/test/org/apache/nutch/indexer/jexl/TestJexlIndexingFilter.java
--
To stop receiving notification emails like this one, please contact
['"[email protected]" <[email protected]>'].