[
https://issues.apache.org/jira/browse/NUTCH-3064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18051130#comment-18051130
]
ASF GitHub Bot commented on NUTCH-3064:
---------------------------------------
lewismc commented on PR #825:
URL: https://github.com/apache/nutch/pull/825#issuecomment-3736568735
This PR now upgrades the `index-geoip` plugin to use MaxMind GeoIP2 Java API
5.0.2, with significant architectural improvements including support for
multiple database types and in-memory caching.
## Changes
### Dependency Updates
- `geoip2`: upgraded to **5.0.2**
- `maxmind-db`: upgraded to **4.0.2**
- `jackson-datatype-jsr310`: added **2.20.1** (new transitive dependency)
### Performance Improvement — CHMCache
Database readers now use `CHMCache` (ConcurrentHashMap Cache) from the
maxmind-db library for improved lookup performance:
```java
DatabaseReader reader = new DatabaseReader.Builder(db)
.withCache(new CHMCache())
.build();
```
This caches parsed database nodes in memory, reducing disk I/O and improving
throughput when the same IP prefixes are queried repeatedly during indexing.
### New Configuration Options in `conf/nutch-default.xml`
The plugin now supports multiple database types simultaneously. Configure
each by setting its file path:
| Property | Description |
|----------|-------------|
| `index.geoip.db.anonymous` | Anonymous IP database — identifies VPNs,
proxies, Tor exit nodes |
| `index.geoip.db.asn` | ASN database — autonomous system number and
organization |
| `index.geoip.db.city` | City database — city, subdivision, country,
continent, coordinates |
| `index.geoip.db.connection` | Connection Type database — Cable/DSL,
Cellular, Corporate, Satellite |
| `index.geoip.db.domain` | Domain database — second-level domain for the IP
|
| `index.geoip.db.isp` | ISP database — ISP name, organization, ASN |
### MaxMind Insights Web Service Support
| Property | Description |
|----------|-------------|
| `index.geoip.insights.userid` | User ID for MaxMind Precision Insights API
|
| `index.geoip.insights.licensekey` | License key for the Insights API |
### Architecture Improvements
- Refactored to support multiple databases via `EnumMap<DatabaseType,
DatabaseReader>`
- Each database type is loaded independently and queried in sequence
- Proper resource cleanup via `Closeable` implementation
- Graceful error handling per-database (one failure doesn't block others)
## Files Modified
- `src/plugin/index-geoip/` — plugin source, tests, dependencies, and config
- `build.xml` — root build configuration
- `conf/nutch-default.xml` — new GeoIP configuration properties
- `src/plugin/build.xml` — plugin build configuration
- `src/plugin/indexer-solr/schema.xml` — Solr schema field definitions
> Upgrade index-geoip to GeoIP2 5.0.2
> -----------------------------------
>
> Key: NUTCH-3064
> URL: https://issues.apache.org/jira/browse/NUTCH-3064
> Project: Nutch
> Issue Type: Task
> Components: index-geoip, plugin
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Priority: Minor
> Fix For: 1.22
>
>
> A recent mailing list question about the index-geoip plugin prompted me to
> take a look at it and perform any necessary maintenance.
> As of writing, the latest dependency can be found at
> [https://central.sonatype.com/artifact/com.maxmind.geoip2/geoip2] at v4.2.0.
> At a minimum this ticket will accomplish the dependency update. I'll also
> have a look at documentation and maybe provide some unit tests... which I
> neglected to furnish last time around.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)