lewismc commented on PR #825: URL: https://github.com/apache/nutch/pull/825#issuecomment-3776075746
Most recent updates address a field duplication issue which could result when chaining multiple GeoIP databases. Here's the example of running `indexchecker` ``` ./runtime/local/bin/nutch indexchecker https://nutch.apache.org ... accuracyRadius : 1000 isPublicProxy : false countryIsoCode : US cityNetworkAddress : 151.101.0.0/21 countryNetworkAddress : 151.101.0.0/21 countryGeoNameId : 6252001 autonomousSystemNumber : 54113 title : Apache Nutch™ content : Apache Nutch™ Apache Nutch™ Apache Nutch™ Community Development Docs Download News The Apache Softwa isHostingProvider : false isTorExitNode : false digest : 09f55cdd88bb9a668023f96143ec9605 host : nutch.apache.org id : https://nutch.apache.org isAnycast : false continentCode : NA isLegitimateProxy : false ip : 151.101.2.132 timeZone : America/Chicago isAnonymousVpn : false isResidentialProxy : false autonomousSystemOrganization : FASTLY url : https://nutch.apache.org isAnonymous : false tstamp : Tue Jan 20 20:21:34 PST 2026 latLon : 37.751,-97.822 countryInEuropeanUnion : false continentGeoNameId : 6255149 countryName : United States continentName : North America asnNetworkAddress : 151.101.0.0/16 ``` Required configuration ``` <property> <name>store.ip.address</name> <value>true</value> <description>Enables us to capture the specific IP address (InetSocketAddress) of the host which we connect to via the given protocol. Currently supported by: protocol-ftp, protocol-http, protocol-okhttp, protocol-htmlunit, protocol-selenium. Note that the IP address is required by the plugin index-geoip and when writing WARC files. </description> </property> <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor|geoip)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. By default Nutch includes plugins to crawl HTML and various other document formats via HTTP/HTTPS and indexing the crawled content into Solr. More plugins are available to support more indexing backends, to fetch ftp:// and file:// URLs, for focused crawling, and many other use cases. </description> </property> <property> <name>index.geoip.db.asn</name> <value>GeoLite2-ASN.mmdb</value> <description> GeoIP2/GeoLite2 ASN database file (MMDB format). Provides autonomous system number and organization information. </description> </property> <property> <name>index.geoip.db.city</name> <value>GeoLite2-City.mmdb</value> <description> GeoIP2/GeoLite2 City database file (MMDB format). Provides city, subdivision, country, continent, and location data. </description> </property> <property> <name>index.geoip.db.country</name> <value>GeoLite2-Country.mmdb</value> <description> GeoIP2/GeoLite2 Country database file (MMDB format). Provides country, continent, and represented country information. This is a lighter-weight alternative to the City database when only country-level information is needed. </description> </property> ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]

