This is an automated email from the ASF dual-hosted git repository.
jnioche pushed a change to branch 851
in repository https://gitbox.apache.org/repos/asf/incubator-stormcrawler.git
from 27414e98 Initial code for using Docker image for testing SOLR - WIP
add 56fa9b7b nextFetchDate field in SOLR schema should be optional, fixes
#1051
add c4a3e090 OpenSearch 2.7.0 + renamed OpenSearchConnection (#1064)
add 718892e3 (Re)separate injection from crawl topologies in *Search
archetypes, fixes #1065
add 8481aefa Remove injection from crawl topologies in *Search archetypes,
fixes #1065
add 0ddaaba1 BasicURLNormalizer .unmangleQueryString() returns invalid
results if "&" symbol in a parents path #1059 (#1062)
add 63f836a1 Removed remaining references to ES in OPenSearch module
add 3876b539 Dependency upgrades.fixes #1066 (#1067)
add acddeb2b Automatic creation of index definitions should use the bolt
type (#1069)
add 828bd42f Maven plugin upgrades + better handling of plugin versions
add 2db5017d bgufix test jar not attached
add a07f32d3 Update maven.yml
add 9589e1b2 mechanism to retrieve more generic value of configuration
(#1071)
add 52ca5b23 Merge branch 'master' of
github.com:DigitalPebble/storm-crawler
add 7d972684 Batch requests in DeleterBolt, fixes #1072
add 1bc37a78 Update README.md
add 9920e6b1 Create DeletionBolt.java for Solr. #1050 (#1073)
add d1d2d590 SOLR: suppress warnings + minor changes and Javadoc + added
deletion to default topology
add 6a15da1d Tika 2.8.0, fixes 1066
add edba0d04 Increase the number of redirects to 5 for Robots.txt fetching
(#1074)
add f2b30cf4 Add test coverage reports with JaCoCo and Coveralls, fixes
#1075
add 92029b6f #1075 - Add test coverage reports with JaCoCo
add bfbfddae #1075 - Update GH workflow to reduce log spam by adding -B
and --no-transfer-progess maven options
add fc36a105 Issue #1042: Adapt parsing of robots.txt files (#1055)
add 91ae9778 Applied formatting
add 487f1e30 Upgrades to XSoup 0.3.7, fixes #1082
add d8188746 Test URL Filtering from the command line (#1081)
add f2d29fdb CC 1.4, fix #1085
add c6e5aa80 Minor - uppercase static field name to follow conventions
add 90e52e33 Upgrade to Storm 2.5.0, fix #1089
add 24803236 Tika 2.9.0, fixes #1090
add b4bfebdc Pre-release 2.9
add 0b282bbd [maven-release-plugin] prepare release 2.9
add f7dfa823 [maven-release-plugin] prepare for next development iteration
add 156f817c Selenium test (#1093)
add 15711ad2 Dependency upgrades,fix #1094. moved managt of version for
testcontainers to top level + various mvn plugins upgrades
add a6455581 upgraded plugin dependencies in archetypes; fix #1094
add 848166dd SQL StatusUpdaterBolt bugs, fix #1095
add 8c7eac63 Add static utility class to URLPartitioner
add bc21ebfa Trivial change to README in OpenSearch archetype
add c630e614 Protocol util - add option to dump the content to a tmp file
add 33696686 Activate sitemap discovery in archetypes; fix #1096
add c7a5578a Make all protocol implementations testable on the command
line, fix #1097
add 51564508 Add OR operator for filter logic in DelegatorProtocol (and
custom flag for robots) fix #1098
add 7670cf1e Remove deprecated class DelegatorRemoteDriverProtocol,fix
#1099
add 114fd9e9 Turn off tracing in Selenium driver, fix #1100
add 87b0eb19 refactoring timeouts Selenium (#1102)
add ee01cbd3 Bug fix post 1102
add d6f13776 Improvements and fixes to HttpRobotRulesParser when following
redirects (#1103)
add 18aae321 User agent substitution not handled correctly, fix #1109
add 0623bdea Removed unused conf, fix #1099
add e233c854 DelegatorProtocol to filter with regexps on URLs, fix #1110
add c313b4fe Fetcher, set number of threads via metadata, fix #1111.
Clarify variable for custom minCrawlDelay
add 54620065 Fetcher: pass custom delay for queues via metadata, fix #1112
add 2bd817e3 Pre 2.10 release
add 58c29554 [maven-release-plugin] prepare release 2.10
add 8406ce7b [maven-release-plugin] prepare for next development iteration
add 7f8f8292 Fix README
add 7092b62b Maven plugin updates
add e9d0edee Applied formatting with new version of the plugin
add cfe61d7c OS 211 (#1114)
add ef31e509 Improve Selenium tests,fix #1115
add 15562121 Use mock server for selenium tests, fix #1116 (#1119)
add adb44fb4 pom cleanups; jwarc & wiremock dependency upgrades
add f526e47f Dependency upgrades,fix #1118
add 5e8802f6 Selenium tests: moved Jetty handling to abstract class so
that it can be reused from other implementations
add 4d3340fc Issue #728: Adding asterisk for metadata transfer (#1117)
add 2eaa33dd Added missing license header to MetadataTest
add 76a70ba8 AbstractIndexerBolt - avoid reparsing metadata keys for each
document, fixes #1124
add 0a8afbf5 WARCSpout loads inputs using HDFS (#1122)
add 857bf09d Merge branch 'master' of
github.com:DigitalPebble/storm-crawler
add 3bd2d7ac Fix wrong most recent date was set (#1126)
add ad706a4a Upgrade to Apache Storm 2.6.0, fix #1127
add 00f319b0 FileSpout: spread the work based on the number of
instances,fix #1125
add ac4408c2 Add configurable delay between launching Fetch threads, fix
#1128
add 71cae464 SQL MetricsConsumer use Timestamps instead of dates
add 87145c3a Add debug to protocolfactory to see which instance of a
protocol got a URL
add 642cf5fb Glob field mapping for indexer.md.mapping (#1130)
add 012dace8 Archetypes to prompt user for user agent values,fix #1131
add 5f83770b Remove default values for user agent,fix #1129
add 5740f42e Utilize new SimpleRobotRulesParser API entry point,fix #1086
add c3ae8c7d Fix flaky test in AdaptiveSchedulerTest.testSchedule,fix #1076
add 6869b5ac Use versioned image for standalone-chrome in Selenium tests
add babf4a72 archetypes: fix variable rewrite + httpagentversion won't
have a default anymore; fixes #1131
add dfe6d236 OpenSearch dashboard script work from anywhere, fix #1132
add 7f70a47e Add committer statement (#1134)
add 31a4b2ab Implement configurable getDocumentID in DeletionBolt (#1135)
add d67ba6bc import Kibana script work from anywhere, fix #1136
add 1ee61a44 Add two tests for SiteMapParserBolt (#1138)
add b6ea3639 dependency upgrades (#1139)
add 66086162 JSoup 1.17.2
add 5392fc93 Release 2.11
add ccd318c6 [maven-release-plugin] prepare release 2.11
add 94f8bd2c [maven-release-plugin] prepare for next development iteration
add 93747cd7 Handling of DateTimeParseException in WARCSpout (#1140)
add a8d7419b Improve metrics for StatusUpdaterBolts,fix #1141
add 41dd9100 Add sniffing for OpenSearch, fix #1142
add 069f6850 Configure proxy with a single conf element + improve handling
of blank values in SCProxy; improvement to CharsetIdentification
add eb69c1e6 Create CODE_OF_CONDUCT.md
add 701ef3c3 Update README.md
add b1e0caa3 Merge branch 'master' of
github.com:DigitalPebble/storm-crawler
add 32eab34b Generate THIRD-PARTY.txt file, fixes #1145 (#1146)
add 4b9a8a63 OpenSearch tests to use explicitly versioned Docker image,
fixes #1147
add 16da6526 bugfix - had forgotten to add the new file
add 953a8c5f Remove coveralls maven plugin, fixes #1148 (#1149)
add 2f80e7ab Dependency upgrades, fix #1144
add 15d29d26 Removed dead link to screenshot of Kibana dash in ES module
add fcc3b979 OpenSearch 2.12.0, fixes #1150
add 7109685e Force version of commons-io to 2.11.0, fixes #1151
add d404022a Partial revert of #1144 to keep Jackson in sync with Apache
Storm
add 56a646f2 Update third-party
add 04e711db Add properties for missing third party libraries
add 1b5c0384 OpenSearch - better handling of mappings (#1155)
add 8dee25ca Delete CODE_OF_CONDUCT.md (#1158)
add 5a51efb3 Create DISCLAIMER (#1159)
add bc8de236 Update NOTICE (#1160)
add bdc34cbc Changed package names to org.apache + fixed references to
DigitalPebble where possible (#1165)
new d2ef5a1f Merge branch 'main' into 851
new 08e8e76a Merge from main
The 2 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails. The revisions
listed as "add" were already present in the repository and have only
been added to this reference.
Summary of changes:
.github/workflows/code_coverage.yml | 29 ++
.github/workflows/maven.yml | 6 +-
DISCLAIMER | 10 +
NOTICE | 4 +-
README.md | 36 +-
THIRD-PARTY.properties | 4 +
THIRD-PARTY.txt | 547 +++++++++++++++++++++
archetype/pom.xml | 6 +-
.../META-INF/maven/archetype-metadata.xml | 71 +--
.../main/resources/archetype-resources/README.md | 2 +-
.../archetype-resources/crawler-conf.yaml | 58 ++-
.../resources/archetype-resources/crawler.flux | 20 +-
.../src/main/resources/archetype-resources/pom.xml | 24 +-
.../src/main/java/CrawlTopology.java | 24 +-
.../src/main/resources/jsoupfilters.json | 6 +-
.../src/main/resources/parsefilters.json | 8 +-
.../src/main/resources/urlfilters.json | 18 +-
core/pom.xml | 54 +-
.../stormcrawler/protocol/Protocol.java | 40 --
.../selenium/DelegatorRemoteDriverProtocol.java | 81 ---
.../protocol/selenium/RemoteDriverProtocol.java | 87 ----
.../apache}/stormcrawler/ConfigurableTopology.java | 6 +-
.../apache}/stormcrawler/Constants.java | 2 +-
.../apache}/stormcrawler/JSONResource.java | 2 +-
.../apache}/stormcrawler/Metadata.java | 20 +-
.../apache}/stormcrawler/bolt/FeedParserBolt.java | 28 +-
.../apache}/stormcrawler/bolt/FetcherBolt.java | 108 ++--
.../apache}/stormcrawler/bolt/JSoupParserBolt.java | 48 +-
.../stormcrawler/bolt/SimpleFetcherBolt.java | 45 +-
.../stormcrawler/bolt/SiteMapParserBolt.java | 28 +-
.../stormcrawler/bolt/StatusEmitterBolt.java | 26 +-
.../apache}/stormcrawler/bolt/URLFilterBolt.java | 12 +-
.../stormcrawler/bolt/URLPartitionerBolt.java | 8 +-
.../apache}/stormcrawler/filtering/URLFilter.java | 6 +-
.../apache}/stormcrawler/filtering/URLFilters.java | 84 +++-
.../filtering/basic/BasicURLFilter.java | 6 +-
.../filtering/basic/BasicURLNormalizer.java | 23 +-
.../filtering/basic/SelfURLFilter.java | 6 +-
.../filtering/depth/MaxDepthFilter.java | 8 +-
.../stormcrawler/filtering/host/HostURLFilter.java | 6 +-
.../filtering/metadata/MetadataFilter.java | 6 +-
.../filtering/regex/FastURLFilter.java | 10 +-
.../stormcrawler/filtering/regex/RegexRule.java | 2 +-
.../filtering/regex/RegexURLFilter.java | 2 +-
.../filtering/regex/RegexURLFilterBase.java | 6 +-
.../filtering/regex/RegexURLNormalizer.java | 6 +-
.../filtering/robots/RobotsFilter.java | 14 +-
.../filtering/sitemap/SitemapFilter.java | 10 +-
.../stormcrawler/indexing/AbstractIndexerBolt.java | 143 ++++--
.../stormcrawler/indexing/DummyIndexer.java | 8 +-
.../stormcrawler/indexing/StdOutIndexer.java | 8 +-
.../stormcrawler/jsoup/LDJsonParseFilter.java | 12 +-
.../stormcrawler/jsoup/LinkParseFilter.java | 20 +-
.../apache}/stormcrawler/jsoup/XPathFilter.java | 12 +-
.../parse/DocumentFragmentBuilder.java | 2 +-
.../apache}/stormcrawler/parse/JSoupFilter.java | 7 +-
.../apache}/stormcrawler/parse/JSoupFilters.java | 10 +-
.../apache}/stormcrawler/parse/Outlink.java | 4 +-
.../apache}/stormcrawler/parse/ParseData.java | 4 +-
.../apache}/stormcrawler/parse/ParseFilter.java | 8 +-
.../apache}/stormcrawler/parse/ParseFilters.java | 8 +-
.../apache}/stormcrawler/parse/ParseResult.java | 4 +-
.../apache}/stormcrawler/parse/TextExtractor.java | 4 +-
.../parse/filter/CollectionTagger.java | 10 +-
.../CommaSeparatedToMultivaluedMetadata.java | 8 +-
.../parse/filter/DebugParseFilter.java | 6 +-
.../parse/filter/DomainParseFilter.java | 12 +-
.../parse/filter/LDJsonParseFilter.java | 10 +-
.../stormcrawler/parse/filter/LinkParseFilter.java | 20 +-
.../parse/filter/MD5SignatureParseFilter.java | 10 +-
.../parse/filter/MimeTypeNormalization.java | 8 +-
.../stormcrawler/parse/filter/XPathFilter.java | 10 +-
.../persistence/AbstractQueryingSpout.java | 9 +-
.../persistence/AbstractStatusUpdaterBolt.java | 10 +-
.../persistence/AdaptiveScheduler.java | 22 +-
.../stormcrawler/persistence/DefaultScheduler.java | 14 +-
.../persistence/EmptyQueueListener.java | 2 +-
.../persistence/MemoryStatusUpdater.java | 6 +-
.../stormcrawler/persistence/Scheduler.java | 8 +-
.../apache}/stormcrawler/persistence/Status.java | 2 +-
.../persistence/StdOutStatusUpdater.java | 4 +-
.../persistence/urlbuffer/AbstractURLBuffer.java | 8 +-
.../persistence/urlbuffer/PriorityURLBuffer.java | 4 +-
.../persistence/urlbuffer/SchedulingURLBuffer.java | 4 +-
.../persistence/urlbuffer/SimpleURLBuffer.java | 2 +-
.../persistence/urlbuffer/URLBuffer.java | 12 +-
.../protocol/AbstractHttpProtocol.java | 118 +----
.../stormcrawler/protocol/DelegatorProtocol.java | 159 ++++--
.../apache}/stormcrawler/protocol/HttpHeaders.java | 2 +-
.../protocol/HttpRobotRulesParser.java | 87 +++-
.../org/apache/stormcrawler/protocol/Protocol.java | 156 ++++++
.../stormcrawler/protocol/ProtocolFactory.java | 17 +-
.../stormcrawler/protocol/ProtocolResponse.java | 10 +-
.../apache}/stormcrawler/protocol/RobotRules.java | 2 +-
.../stormcrawler/protocol/RobotRulesParser.java | 84 +++-
.../stormcrawler/protocol/file/FileProtocol.java | 16 +-
.../stormcrawler/protocol/file/FileResponse.java | 8 +-
.../protocol/httpclient/HttpProtocol.java | 25 +-
.../protocol/okhttp/DNSResolutionListener.java | 2 +-
.../stormcrawler/protocol/okhttp/HttpProtocol.java | 24 +-
.../protocol/selenium/NavigationFilter.java | 8 +-
.../protocol/selenium/NavigationFilters.java | 14 +-
.../protocol/selenium/RemoteDriverProtocol.java | 131 +++++
.../protocol/selenium/SeleniumProtocol.java | 26 +-
.../stormcrawler/proxy/MultiProxyManager.java | 8 +-
.../apache}/stormcrawler/proxy/ProxyManager.java | 4 +-
.../apache}/stormcrawler/proxy/SCProxy.java | 14 +-
.../stormcrawler/proxy/SingleProxyManager.java | 13 +-
.../apache}/stormcrawler/spout/FileSpout.java | 32 +-
.../apache}/stormcrawler/spout/MemorySpout.java | 10 +-
.../stormcrawler/util/AbstractConfigurable.java | 2 +-
.../stormcrawler/util/CharsetIdentification.java | 6 +-
.../stormcrawler/util/CollectionMetric.java | 2 +-
.../apache}/stormcrawler/util/ConfUtils.java | 83 +++-
.../apache}/stormcrawler/util/Configurable.java | 2 +-
.../stormcrawler/util/ConfigurableHelper.java | 2 +-
.../apache}/stormcrawler/util/CookieConverter.java | 2 +-
.../stormcrawler/util/InitialisationUtil.java | 2 +-
.../stormcrawler/util/MetadataTransfer.java | 30 +-
.../stormcrawler/util/PerSecondReducer.java | 2 +-
.../apache}/stormcrawler/util/RefreshTag.java | 2 +-
.../apache}/stormcrawler/util/RobotsTags.java | 4 +-
.../apache}/stormcrawler/util/StringTabScheme.java | 4 +-
.../apache}/stormcrawler/util/URLPartitioner.java | 28 +-
.../stormcrawler/util/URLStreamGrouping.java | 8 +-
.../apache}/stormcrawler/util/URLUtil.java | 2 +-
core/src/main/resources/crawler-default.yaml | 97 +++-
.../apache/stormcrawler/MetadataTest.java} | 26 +-
.../stormcrawler/TestMetadataSerialization.java | 2 +-
.../apache}/stormcrawler/TestOutputCollector.java | 2 +-
.../apache}/stormcrawler/TestUtil.java | 2 +-
.../stormcrawler/bolt/AbstractFetcherBoltTest.java | 12 +-
.../stormcrawler/bolt/FeedParserBoltTest.java | 16 +-
.../apache}/stormcrawler/bolt/FetcherBoltTest.java | 2 +-
.../stormcrawler/bolt/JSoupParserBoltTest.java | 16 +-
.../stormcrawler/bolt/SimpleFetcherBoltTest.java | 2 +-
.../stormcrawler/bolt/SiteMapParserBoltTest.java | 95 ++--
.../stormcrawler/filtering/BasicURLFilterTest.java | 6 +-
.../filtering/BasicURLNormalizerTest.java | 22 +-
.../stormcrawler/filtering/FastURLFilterTest.java | 6 +-
.../stormcrawler/filtering/HostURLFilterTest.java | 6 +-
.../stormcrawler/filtering/MaxDepthFilterTest.java | 8 +-
.../stormcrawler/filtering/MetadataFilterTest.java | 6 +-
.../stormcrawler/filtering/RegexFilterTest.java | 6 +-
.../ClassInheritingFomAbstractAndInterface.java | 6 +-
.../ClassInheritingFromAbstractClassOnly.java | 4 +-
.../ClassInheritingFromOpenClass.java | 4 +-
.../ClassWithoutValidConstructor.java | 4 +-
.../initialisation/FinalClassToInitialize.java | 2 +-
.../helper/initialisation/SimpleOpenClass.java | 2 +-
.../helper/initialisation/base/AbstractClass.java | 2 +-
.../helper/initialisation/base/ITestInterface.java | 2 +-
.../OpenClassWithAbstractClassAndInterface.java | 2 +-
.../stormcrawler/indexer/BasicIndexingTest.java | 27 +-
.../apache}/stormcrawler/indexer/DummyIndexer.java | 6 +-
.../stormcrawler/indexer/IndexerTester.java | 10 +-
.../apache}/stormcrawler/json/JsoupFilterTest.java | 10 +-
.../stormcrawler/jsoup/JSoupFiltersTest.java | 10 +-
.../stormcrawler/parse/DuplicateLinksTest.java | 10 +-
.../apache}/stormcrawler/parse/ParsingTester.java | 8 +-
.../stormcrawler/parse/StackOverflowTest.java | 8 +-
.../stormcrawler/parse/TextExtractorTest.java | 2 +-
.../parse/filter/CSVMetadataFilterTest.java | 8 +-
.../parse/filter/CollectionTaggerTest.java | 4 +-
.../parse/filter/SubDocumentsFilterTest.java | 8 +-
.../parse/filter/SubDocumentsParseFilter.java | 8 +-
.../stormcrawler/parse/filter/XPathFilterTest.java | 8 +-
.../persistence/AdaptiveSchedulerTest.java | 19 +-
.../persistence/DefaultSchedulerTest.java | 4 +-
.../stormcrawler/persistence/URLBufferTest.java | 10 +-
.../protocol/AbstractProtocolTest.java | 96 ++++
.../protocol/DelegationProtocolTest.java | 41 +-
.../stormcrawler/protocol/DummyProtocol.java} | 28 +-
.../stormcrawler/protocol/HttpHeadersTest.java | 2 +-
.../protocol/HttpRobotRulesParserTest.java | 282 +++++++++++
.../protocol/selenium/ProtocolTest.java | 166 +++++++
.../stormcrawler/proxy/MultiProxyManagerTest.java | 2 +-
.../apache}/stormcrawler/proxy/SCProxyTest.java | 2 +-
.../stormcrawler/proxy/SingleProxyManagerTest.java | 2 +-
.../apache/stormcrawler/util/ConfUtilsTest.java | 64 +++
.../stormcrawler/util/CookieConverterTest.java | 2 +-
.../stormcrawler/util/InitialisationUtilTest.java | 6 +-
.../stormcrawler/util/MetadataTransferTest.java | 61 ++-
.../apache}/stormcrawler/util/RefreshTagTest.java | 2 +-
.../apache}/stormcrawler/util/RobotsTagsTest.java | 4 +-
core/src/test/resources/basicurlnormalizer.json | 4 +-
core/src/test/resources/delegator-conf.yaml | 21 +-
core/src/test/resources/test.jsoupfilters.json | 8 +-
core/src/test/resources/test.parsefilters.json | 8 +-
core/src/test/resources/test.subdocfilter.json | 6 +-
.../test/resources/tripadvisor.sitemap.index.xml | 22 +
core/src/test/resources/tripadvisor.sitemap.xml.gz | Bin 0 -> 1537978 bytes
external/aws/README.md | 2 +-
external/aws/pom.xml | 8 +-
.../aws/bolt/CloudSearchConstants.java | 2 +-
.../aws/bolt/CloudSearchIndexerBolt.java | 12 +-
.../stormcrawler/aws/bolt/CloudSearchUtils.java | 2 +-
.../stormcrawler/aws/s3/AbstractS3CacheBolt.java | 4 +-
.../stormcrawler/aws/s3/S3CacheChecker.java | 6 +-
.../apache}/stormcrawler/aws/s3/S3Cacher.java | 6 +-
.../stormcrawler/aws/s3/S3ContentCacher.java | 4 +-
external/elasticsearch/README.md | 20 +-
external/elasticsearch/archetype/pom.xml | 4 +-
.../META-INF/maven/archetype-metadata.xml | 35 +-
.../main/resources/archetype-resources/README.md | 6 +-
.../archetype-resources/crawler-conf.yaml | 58 ++-
.../resources/archetype-resources/es-conf.yaml | 2 +-
.../resources/archetype-resources/es-crawler.flux | 52 +-
.../archetype-resources/es-injection.flux | 50 ++
.../archetype-resources/kibana/importKibana.sh | 8 +-
.../src/main/resources/archetype-resources/pom.xml | 24 +-
.../src/main/java/ESCrawlTopology.java | 36 +-
.../src/main/resources/jsoupfilters.json | 6 +-
.../src/main/resources/parsefilters.json | 8 +-
.../src/main/resources/urlfilters.json | 18 +-
external/elasticsearch/pom.xml | 9 +-
.../BulkItemResponseToFailedFlag.java | 2 +-
.../elasticsearch/ElasticSearchConnection.java | 4 +-
.../elasticsearch/bolt/DeletionBolt.java | 23 +-
.../elasticsearch/bolt/IndexerBolt.java | 20 +-
.../filtering/JSONURLFilterWrapper.java | 14 +-
.../elasticsearch/metrics/MetricsConsumer.java | 8 +-
.../elasticsearch/metrics/StatusMetricsBolt.java | 6 +-
.../parse/filter/JSONResourceWrapper.java | 14 +-
.../elasticsearch/persistence/AbstractSpout.java | 10 +-
.../persistence/AggregationSpout.java | 8 +-
.../elasticsearch/persistence/CollapsingSpout.java | 4 +-
.../elasticsearch/persistence/HybridSpout.java | 6 +-
.../elasticsearch/persistence/ScrollSpout.java | 10 +-
.../persistence/StatusUpdaterBolt.java | 50 +-
.../elasticsearch/bolt/IndexerBoltTest.java | 12 +-
.../elasticsearch/bolt/StatusBoltTest.java | 14 +-
external/langid/pom.xml | 6 +-
.../stormcrawler/parse/filter/LanguageID.java | 12 +-
external/opensearch/OS_IndexInit.sh | 23 -
external/opensearch/README.md | 19 +-
external/opensearch/archetype/pom.xml | 4 +-
.../META-INF/archetype-post-generate.groovy | 5 +-
.../META-INF/maven/archetype-metadata.xml | 37 +-
.../resources/archetype-resources/OS_IndexInit.sh | 25 +
.../main/resources/archetype-resources/README.md | 17 +-
.../archetype-resources/crawler-conf.yaml | 58 ++-
.../resources/archetype-resources/crawler.flux | 50 +-
.../dashboards/importDashboards.sh | 8 +-
.../resources/archetype-resources/injection.flux | 50 ++
.../archetype-resources/opensearch-conf.yaml | 12 +-
.../src/main/resources/archetype-resources/pom.xml | 24 +-
.../src/main/resources/indexer.mapping} | 0
.../src/main/resources/jsoupfilters.json | 6 +-
.../src/main/resources/metrics.mapping | 0
.../src/main/resources/parsefilters.json | 8 +-
.../src/main/resources/status.mapping | 0
.../src/main/resources/urlfilters.json | 18 +-
external/opensearch/opensearch-conf.yaml | 12 +-
external/opensearch/pom.xml | 24 +-
.../stormcrawler/opensearch/bolt/DeletionBolt.java | 94 ----
.../opensearch/BulkItemResponseToFailedFlag.java | 10 +-
.../apache}/stormcrawler/opensearch/Constants.java | 2 +-
.../stormcrawler/opensearch/IndexCreation.java | 15 +-
.../opensearch/OpenSearchConnection.java} | 102 ++--
.../stormcrawler/opensearch/bolt/DeletionBolt.java | 308 ++++++++++++
.../stormcrawler/opensearch/bolt/IndexerBolt.java | 59 +--
.../opensearch/filtering/JSONURLFilterWrapper.java | 16 +-
.../opensearch/metrics/MetricsConsumer.java | 26 +-
.../opensearch/metrics/StatusMetricsBolt.java | 20 +-
.../parse/filter/JSONResourceWrapper.java | 38 +-
.../opensearch/persistence/AbstractSpout.java | 78 +--
.../opensearch/persistence/AggregationSpout.java | 18 +-
.../opensearch/persistence/HybridSpout.java | 22 +-
.../opensearch/persistence/StatusUpdaterBolt.java | 81 +--
.../opensearch/bolt/AbstractOpenSearchTest.java | 46 ++
.../opensearch/bolt/IndexerBoltTest.java | 30 +-
.../opensearch/bolt/StatusBoltTest.java | 38 +-
.../resources/indexer.mapping} | 0
.../src/{main => test}/resources/metrics.mapping | 0
.../src/test/resources/status.mapping | 0
external/pom.xml | 23 +-
external/solr/README.md | 2 +-
external/solr/cores/status/conf/schema.xml | 2 +-
external/solr/pom.xml | 14 +-
external/solr/solr-conf.yaml | 2 +-
.../apache}/stormcrawler/solr/SeedInjector.java | 10 +-
.../apache}/stormcrawler/solr/SolrConnection.java | 4 +-
.../stormcrawler/solr/SolrCrawlTopology.java | 26 +-
.../stormcrawler/solr/bolt/DeletionBolt.java | 86 ++++
.../stormcrawler/solr/bolt/IndexerBolt.java | 13 +-
.../stormcrawler/solr/metrics/MetricsConsumer.java | 6 +-
.../stormcrawler/solr/persistence/SolrSpout.java | 11 +-
.../solr/persistence/StatusUpdaterBolt.java | 17 +-
.../solr/persistence/StatusBoltTest.java | 12 +-
external/sql/pom.xml | 6 +-
external/sql/sql-conf.yaml | 2 +-
.../apache}/stormcrawler/sql/Constants.java | 2 +-
.../apache}/stormcrawler/sql/IndexerBolt.java | 12 +-
.../apache}/stormcrawler/sql/SQLSpout.java | 10 +-
.../apache}/stormcrawler/sql/SQLUtil.java | 2 +-
.../stormcrawler/sql/StatusUpdaterBolt.java | 33 +-
.../stormcrawler/sql/metrics/MetricsConsumer.java | 18 +-
external/tika/README.md | 4 +-
external/tika/pom.xml | 12 +-
.../apache}/stormcrawler/tika/DOMBuilder.java | 2 +-
.../apache}/stormcrawler/tika/ParserBolt.java | 38 +-
.../apache}/stormcrawler/tika/RedirectionBolt.java | 4 +-
.../stormcrawler/tika/XMLCharacterRecognizer.java | 2 +-
.../apache}/stormcrawler/tika/ParserBoltTest.java | 16 +-
external/urlfrontier/README.md | 2 +-
external/urlfrontier/pom.xml | 9 +-
.../stormcrawler/urlfrontier/Constants.java | 2 +-
.../urlfrontier/ManagedChannelUtil.java | 4 +-
.../apache}/stormcrawler/urlfrontier/Spout.java | 10 +-
.../urlfrontier/StatusUpdaterBolt.java | 14 +-
.../urlfrontier/StatusUpdaterBoltTest.java | 16 +-
.../urlfrontier/URLFrontierContainer.java | 2 +-
.../urlfrontier/URLFrontierContainerConfig.java | 2 +-
external/warc/README.md | 43 +-
external/warc/pom.xml | 20 +-
.../warc/FileTimeSizeRotationPolicy.java | 2 +-
.../apache}/stormcrawler/warc/GzipHdfsBolt.java | 2 +-
.../stormcrawler/warc/WARCFileNameFormat.java | 2 +-
.../apache}/stormcrawler/warc/WARCHdfsBolt.java | 6 +-
.../stormcrawler/warc/WARCRecordFormat.java | 20 +-
.../stormcrawler/warc/WARCRequestRecordFormat.java | 8 +-
.../apache}/stormcrawler/warc/WARCSpout.java | 65 ++-
.../stormcrawler/warc/WARCHdfsBoltTest.java | 10 +-
.../stormcrawler/warc/WARCRecordFormatTest.java | 8 +-
.../apache/stormcrawler/warc/WARCSpoutTest.java | 70 +++
external/warc/src/test/resources/test.warc.gz | Bin 0 -> 301243 bytes
.../src/test/resources/unparsable-date.warc.gz | Bin 0 -> 938 bytes
external/warc/src/test/resources/warc.inputs | 2 +
pom.xml | 264 ++++++++--
330 files changed, 5219 insertions(+), 2381 deletions(-)
create mode 100644 .github/workflows/code_coverage.yml
create mode 100644 DISCLAIMER
create mode 100644 THIRD-PARTY.properties
create mode 100644 THIRD-PARTY.txt
delete mode 100644
core/src/main/java/com/digitalpebble/stormcrawler/protocol/Protocol.java
delete mode 100644
core/src/main/java/com/digitalpebble/stormcrawler/protocol/selenium/DelegatorRemoteDriverProtocol.java
delete mode 100644
core/src/main/java/com/digitalpebble/stormcrawler/protocol/selenium/RemoteDriverProtocol.java
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/ConfigurableTopology.java (95%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/Constants.java (98%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/JSONResource.java (97%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/Metadata.java (92%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/bolt/FeedParserBolt.java (93%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/bolt/FetcherBolt.java (91%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/bolt/JSoupParserBolt.java (93%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/bolt/SimpleFetcherBolt.java (93%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/bolt/SiteMapParserBolt.java (95%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/bolt/StatusEmitterBolt.java (83%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/bolt/URLFilterBolt.java (91%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/bolt/URLPartitionerBolt.java (97%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/filtering/URLFilter.java (91%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/filtering/URLFilters.java (60%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/filtering/basic/BasicURLFilter.java (94%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/filtering/basic/BasicURLNormalizer.java (95%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/filtering/basic/SelfURLFilter.java (90%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/filtering/depth/MaxDepthFilter.java (92%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/filtering/host/HostURLFilter.java (96%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/filtering/metadata/MetadataFilter.java (94%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/filtering/regex/FastURLFilter.java (97%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/filtering/regex/RegexRule.java (97%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/filtering/regex/RegexURLFilter.java (96%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/filtering/regex/RegexURLFilterBase.java (97%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/filtering/regex/RegexURLNormalizer.java (98%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/filtering/robots/RobotsFilter.java (86%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/filtering/sitemap/SitemapFilter.java (86%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/indexing/AbstractIndexerBolt.java (70%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/indexing/DummyIndexer.java (89%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/indexing/StdOutIndexer.java (93%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/jsoup/LDJsonParseFilter.java (91%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/jsoup/LinkParseFilter.java (89%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/jsoup/XPathFilter.java (92%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/parse/DocumentFragmentBuilder.java (98%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/parse/JSoupFilter.java (87%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/parse/JSoupFilters.java (95%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/parse/Outlink.java (94%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/parse/ParseData.java (95%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/parse/ParseFilter.java (88%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/parse/ParseFilters.java (96%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/parse/ParseResult.java (97%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/parse/TextExtractor.java (98%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/parse/filter/CollectionTagger.java (95%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/parse/filter/CommaSeparatedToMultivaluedMetadata.java
(91%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/parse/filter/DebugParseFilter.java (92%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/parse/filter/DomainParseFilter.java (86%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/parse/filter/LDJsonParseFilter.java (93%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/parse/filter/LinkParseFilter.java (89%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/parse/filter/MD5SignatureParseFilter.java (92%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/parse/filter/MimeTypeNormalization.java (91%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/parse/filter/XPathFilter.java (96%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/persistence/AbstractQueryingSpout.java (97%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/persistence/AbstractStatusUpdaterBolt.java (97%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/persistence/AdaptiveScheduler.java (94%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/persistence/DefaultScheduler.java (94%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/persistence/EmptyQueueListener.java (94%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/persistence/MemoryStatusUpdater.java (90%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/persistence/Scheduler.java (90%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/persistence/Status.java (96%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/persistence/StdOutStatusUpdater.java (94%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/persistence/urlbuffer/AbstractURLBuffer.java (93%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/persistence/urlbuffer/PriorityURLBuffer.java (97%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/persistence/urlbuffer/SchedulingURLBuffer.java (98%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/persistence/urlbuffer/SimpleURLBuffer.java (97%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/persistence/urlbuffer/URLBuffer.java (90%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/protocol/AbstractHttpProtocol.java (59%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/protocol/DelegatorProtocol.java (54%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/protocol/HttpHeaders.java (98%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/protocol/HttpRobotRulesParser.java (71%)
create mode 100644
core/src/main/java/org/apache/stormcrawler/protocol/Protocol.java
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/protocol/ProtocolFactory.java (87%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/protocol/ProtocolResponse.java (97%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/protocol/RobotRules.java (98%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/protocol/RobotRulesParser.java (66%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/protocol/file/FileProtocol.java (78%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/protocol/file/FileResponse.java (95%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/protocol/httpclient/HttpProtocol.java (95%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/protocol/okhttp/DNSResolutionListener.java (96%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/protocol/okhttp/HttpProtocol.java (97%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/protocol/selenium/NavigationFilter.java (83%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/protocol/selenium/NavigationFilters.java (90%)
create mode 100644
core/src/main/java/org/apache/stormcrawler/protocol/selenium/RemoteDriverProtocol.java
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/protocol/selenium/SeleniumProtocol.java (78%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/proxy/MultiProxyManager.java (97%)
copy core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/proxy/ProxyManager.java (91%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/proxy/SCProxy.java (93%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/proxy/SingleProxyManager.java (85%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/spout/FileSpout.java (88%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/spout/MemorySpout.java (95%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/util/AbstractConfigurable.java (96%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/util/CharsetIdentification.java (98%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/util/CollectionMetric.java (96%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/util/ConfUtils.java (56%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/util/Configurable.java (99%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/util/ConfigurableHelper.java (99%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/util/CookieConverter.java (99%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/util/InitialisationUtil.java (99%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/util/MetadataTransfer.java (87%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/util/PerSecondReducer.java (97%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/util/RefreshTag.java (97%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/util/RobotsTags.java (98%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/util/StringTabScheme.java (95%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/util/URLPartitioner.java (81%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/util/URLStreamGrouping.java (94%)
rename core/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/util/URLUtil.java (99%)
copy
core/src/test/java/{com/digitalpebble/stormcrawler/protocol/HttpHeadersTest.java
=> org/apache/stormcrawler/MetadataTest.java} (57%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/TestMetadataSerialization.java (98%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/TestOutputCollector.java (98%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/TestUtil.java (99%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/bolt/AbstractFetcherBoltTest.java (92%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/bolt/FeedParserBoltTest.java (90%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/bolt/FetcherBoltTest.java (95%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/bolt/JSoupParserBoltTest.java (96%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/bolt/SimpleFetcherBoltTest.java (95%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/bolt/SiteMapParserBoltTest.java (85%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/filtering/BasicURLFilterTest.java (94%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/filtering/BasicURLNormalizerTest.java (93%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/filtering/FastURLFilterTest.java (94%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/filtering/HostURLFilterTest.java (96%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/filtering/MaxDepthFilterTest.java (93%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/filtering/MetadataFilterTest.java (94%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/filtering/RegexFilterTest.java (95%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/helper/initialisation/ClassInheritingFomAbstractAndInterface.java
(80%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/helper/initialisation/ClassInheritingFromAbstractClassOnly.java
(85%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/helper/initialisation/ClassInheritingFromOpenClass.java
(84%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/helper/initialisation/ClassWithoutValidConstructor.java
(86%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/helper/initialisation/FinalClassToInitialize.java (93%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/helper/initialisation/SimpleOpenClass.java (92%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/helper/initialisation/base/AbstractClass.java (94%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/helper/initialisation/base/ITestInterface.java (92%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/helper/initialisation/base/OpenClassWithAbstractClassAndInterface.java
(93%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/indexer/BasicIndexingTest.java (89%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/indexer/DummyIndexer.java (94%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/indexer/IndexerTester.java (89%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/json/JsoupFilterTest.java (90%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/jsoup/JSoupFiltersTest.java (93%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/parse/DuplicateLinksTest.java (87%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/parse/ParsingTester.java (93%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/parse/StackOverflowTest.java (91%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/parse/TextExtractorTest.java (98%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/parse/filter/CSVMetadataFilterTest.java (88%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/parse/filter/CollectionTaggerTest.java (91%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/parse/filter/SubDocumentsFilterTest.java (87%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/parse/filter/SubDocumentsParseFilter.java (92%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/parse/filter/XPathFilterTest.java (93%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/persistence/AdaptiveSchedulerTest.java (92%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/persistence/DefaultSchedulerTest.java (97%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/persistence/URLBufferTest.java (90%)
create mode 100644
core/src/test/java/org/apache/stormcrawler/protocol/AbstractProtocolTest.java
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/protocol/DelegationProtocolTest.java (69%)
rename
core/src/{main/java/com/digitalpebble/stormcrawler/proxy/ProxyManager.java =>
test/java/org/apache/stormcrawler/protocol/DummyProtocol.java} (61%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/protocol/HttpHeadersTest.java (96%)
create mode 100644
core/src/test/java/org/apache/stormcrawler/protocol/HttpRobotRulesParserTest.java
create mode 100644
core/src/test/java/org/apache/stormcrawler/protocol/selenium/ProtocolTest.java
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/proxy/MultiProxyManagerTest.java (99%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/proxy/SCProxyTest.java (98%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/proxy/SingleProxyManagerTest.java (97%)
create mode 100644
core/src/test/java/org/apache/stormcrawler/util/ConfUtilsTest.java
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/util/CookieConverterTest.java (99%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/util/InitialisationUtilTest.java (97%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/util/MetadataTransferTest.java (50%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/util/RefreshTagTest.java (97%)
rename core/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/util/RobotsTagsTest.java (95%)
create mode 100644 core/src/test/resources/tripadvisor.sitemap.index.xml
create mode 100644 core/src/test/resources/tripadvisor.sitemap.xml.gz
rename external/aws/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/aws/bolt/CloudSearchConstants.java (96%)
rename external/aws/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/aws/bolt/CloudSearchIndexerBolt.java (97%)
rename external/aws/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/aws/bolt/CloudSearchUtils.java (98%)
rename external/aws/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/aws/s3/AbstractS3CacheBolt.java (96%)
rename external/aws/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/aws/s3/S3CacheChecker.java (96%)
rename external/aws/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/aws/s3/S3Cacher.java (97%)
rename external/aws/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/aws/s3/S3ContentCacher.java (94%)
create mode 100644
external/elasticsearch/archetype/src/main/resources/archetype-resources/es-injection.flux
rename external/elasticsearch/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/elasticsearch/BulkItemResponseToFailedFlag.java (98%)
rename external/elasticsearch/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/elasticsearch/ElasticSearchConnection.java (99%)
rename external/elasticsearch/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/elasticsearch/bolt/DeletionBolt.java (82%)
rename external/elasticsearch/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/elasticsearch/bolt/IndexerBolt.java (96%)
rename external/elasticsearch/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/elasticsearch/filtering/JSONURLFilterWrapper.java (93%)
rename external/elasticsearch/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/elasticsearch/metrics/MetricsConsumer.java (95%)
rename external/elasticsearch/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/elasticsearch/metrics/StatusMetricsBolt.java (96%)
rename external/elasticsearch/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/elasticsearch/parse/filter/JSONResourceWrapper.java
(92%)
rename external/elasticsearch/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/elasticsearch/persistence/AbstractSpout.java (96%)
rename external/elasticsearch/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/elasticsearch/persistence/AggregationSpout.java (98%)
rename external/elasticsearch/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/elasticsearch/persistence/CollapsingSpout.java (98%)
rename external/elasticsearch/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/elasticsearch/persistence/HybridSpout.java (97%)
rename external/elasticsearch/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/elasticsearch/persistence/ScrollSpout.java (95%)
rename external/elasticsearch/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/elasticsearch/persistence/StatusUpdaterBolt.java (90%)
rename external/elasticsearch/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/elasticsearch/bolt/IndexerBoltTest.java (94%)
rename external/elasticsearch/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/elasticsearch/bolt/StatusBoltTest.java (94%)
rename external/langid/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/parse/filter/LanguageID.java (93%)
delete mode 100755 external/opensearch/OS_IndexInit.sh
create mode 100755
external/opensearch/archetype/src/main/resources/archetype-resources/OS_IndexInit.sh
create mode 100644
external/opensearch/archetype/src/main/resources/archetype-resources/injection.flux
copy external/opensearch/{src/main/resources/content.mapping =>
archetype/src/main/resources/archetype-resources/src/main/resources/indexer.mapping}
(100%)
copy external/opensearch/{ =>
archetype/src/main/resources/archetype-resources}/src/main/resources/metrics.mapping
(100%)
rename external/opensearch/{ =>
archetype/src/main/resources/archetype-resources}/src/main/resources/status.mapping
(100%)
delete mode 100644
external/opensearch/src/main/java/com/digitalpebble/stormcrawler/opensearch/bolt/DeletionBolt.java
rename external/opensearch/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/opensearch/BulkItemResponseToFailedFlag.java (91%)
rename external/opensearch/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/opensearch/Constants.java (94%)
rename external/opensearch/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/opensearch/IndexCreation.java (89%)
rename
external/opensearch/src/main/java/{com/digitalpebble/stormcrawler/opensearch/OpensearchConnection.java
=> org/apache/stormcrawler/opensearch/OpenSearchConnection.java} (75%)
create mode 100644
external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/bolt/DeletionBolt.java
rename external/opensearch/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/opensearch/bolt/IndexerBolt.java (90%)
rename external/opensearch/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/opensearch/filtering/JSONURLFilterWrapper.java (92%)
rename external/opensearch/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/opensearch/metrics/MetricsConsumer.java (87%)
rename external/opensearch/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/opensearch/metrics/StatusMetricsBolt.java (89%)
rename external/opensearch/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/opensearch/parse/filter/JSONResourceWrapper.java (82%)
rename external/opensearch/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/opensearch/persistence/AbstractSpout.java (75%)
rename external/opensearch/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/opensearch/persistence/AggregationSpout.java (96%)
rename external/opensearch/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/opensearch/persistence/HybridSpout.java (90%)
rename external/opensearch/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/opensearch/persistence/StatusUpdaterBolt.java (85%)
create mode 100644
external/opensearch/src/test/java/org/apache/stormcrawler/opensearch/bolt/AbstractOpenSearchTest.java
rename external/opensearch/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/opensearch/bolt/IndexerBoltTest.java (83%)
rename external/opensearch/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/opensearch/bolt/StatusBoltTest.java (79%)
rename external/opensearch/src/{main/resources/content.mapping =>
test/resources/indexer.mapping} (100%)
rename external/opensearch/src/{main => test}/resources/metrics.mapping (100%)
copy external/{elasticsearch => opensearch}/src/test/resources/status.mapping
(100%)
rename external/solr/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/solr/SeedInjector.java (86%)
rename external/solr/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/solr/SolrConnection.java (97%)
rename external/solr/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/solr/SolrCrawlTopology.java (74%)
create mode 100644
external/solr/src/main/java/org/apache/stormcrawler/solr/bolt/DeletionBolt.java
rename external/solr/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/solr/bolt/IndexerBolt.java (92%)
rename external/solr/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/solr/metrics/MetricsConsumer.java (96%)
rename external/solr/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/solr/persistence/SolrSpout.java (96%)
rename external/solr/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/solr/persistence/StatusUpdaterBolt.java (88%)
rename external/solr/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/solr/persistence/StatusBoltTest.java (93%)
rename external/sql/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/sql/Constants.java (96%)
rename external/sql/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/sql/IndexerBolt.java (94%)
rename external/sql/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/sql/SQLSpout.java (96%)
rename external/sql/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/sql/SQLUtil.java (97%)
rename external/sql/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/sql/StatusUpdaterBolt.java (89%)
rename external/sql/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/sql/metrics/MetricsConsumer.java (92%)
rename external/tika/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/tika/DOMBuilder.java (99%)
rename external/tika/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/tika/ParserBolt.java (93%)
rename external/tika/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/tika/RedirectionBolt.java (97%)
rename external/tika/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/tika/XMLCharacterRecognizer.java (98%)
rename external/tika/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/tika/ParserBoltTest.java (90%)
rename external/urlfrontier/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/urlfrontier/Constants.java (97%)
rename external/urlfrontier/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/urlfrontier/ManagedChannelUtil.java (93%)
rename external/urlfrontier/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/urlfrontier/Spout.java (96%)
rename external/urlfrontier/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/urlfrontier/StatusUpdaterBolt.java (97%)
rename external/urlfrontier/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/urlfrontier/StatusUpdaterBoltTest.java (92%)
rename external/urlfrontier/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/urlfrontier/URLFrontierContainer.java (98%)
rename external/urlfrontier/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/urlfrontier/URLFrontierContainerConfig.java (95%)
rename external/warc/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/warc/FileTimeSizeRotationPolicy.java (98%)
rename external/warc/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/warc/GzipHdfsBolt.java (99%)
rename external/warc/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/warc/WARCFileNameFormat.java (98%)
rename external/warc/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/warc/WARCHdfsBolt.java (95%)
rename external/warc/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/warc/WARCRecordFormat.java (96%)
rename external/warc/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/warc/WARCRequestRecordFormat.java (96%)
rename external/warc/src/main/java/{com/digitalpebble =>
org/apache}/stormcrawler/warc/WARCSpout.java (92%)
rename external/warc/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/warc/WARCHdfsBoltTest.java (97%)
rename external/warc/src/test/java/{com/digitalpebble =>
org/apache}/stormcrawler/warc/WARCRecordFormatTest.java (98%)
create mode 100644
external/warc/src/test/java/org/apache/stormcrawler/warc/WARCSpoutTest.java
create mode 100644 external/warc/src/test/resources/test.warc.gz
create mode 100644 external/warc/src/test/resources/unparsable-date.warc.gz
create mode 100644 external/warc/src/test/resources/warc.inputs