[ 
https://issues.apache.org/jira/browse/NUTCH-3064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881427#comment-17881427
 ] 

ASF GitHub Bot commented on NUTCH-3064:
---------------------------------------

lewismc opened a new pull request, #825:
URL: https://github.com/apache/nutch/pull/825

   **Work in Progress**
   
   This PR begins to address 
[NUTCH-3064](https://issues.apache.org/jira/browse/NUTCH-3064) by performing 
the upgrade of the com.maxmind.geoip2:geoip2 dependency to v4.2.0. It has not 
been tested in distributed Nutch deployment as of yet. I say this because 
although no additional dependencies have been added I will wish to test our a 
full deployment.
   
   In addition to the proposed upgrade I performed some refactoring which I 
considered to be improvements.
   
   # Refactoring/Improvements
   
   1. Establishes unit test(s). I have more work to do here to accommodate the 
change in logic for loading the maxmind db file(s) from the class path.
   2. Removes duplication of configuration documentation, including it only in 
`nutch-default.xml`.
   3. Removes `insightsService` as the default value for the 
`index.geoip.usage` configuration property.
   4. Introduces a new property `index.geoip.db.file` which facilitates 
specifying the Maxmind DB file packaged with Nutch `.job`. 
   5. Adds Javadoc to every Class and Method of the index-geoid plugin (more 
work to be done here)
   6. Uses the [updated GeoIP Database 
guidance](https://github.com/maxmind/GeoIP2-java/blob/main/README.md#database-usage),
 specifically
     - Using the `try` methods; "...If you are looking up many IPs that are not 
contained in the database, the try method will be slightly faster as they do 
not need to construct and throw an exception."
     - Uses [DB 
Caching](https://github.com/maxmind/GeoIP2-java/blob/main/README.md#caching); 
"... Using this cache, lookup performance is significantly improved at the cost 
of a small (~2MB) memory overhead."
   7. Updates the number of fields which are now available for each Database as 
new fields h ave been added to the Java API since I first wrote this plugin.
   8. Simplifies the values available for the `index.geoip.usage` configuration 
property. Available values are now `anonymous`, `asn`, `city`, `connection`, 
`domain`, `insights` or `isp`. **THIS IS A BACKWARDS INCOMPATIBLE BREAKING 
CHANGE** which we would need to call out in the release notes. I decided to 
implement this change [based on recent 
feedback](https://lists.apache.org/thread/1bpwqs1b890pcog8zsqgqyy1mq31tp9n) 
which I agree with btw.
   
   # Future work
   
   I can anticipate a use case where multiple [Maxmind 
DB's](https://www.maxmind.com/en/geoip-databases) and/or [Web service 
looksups](https://www.maxmind.com/en/geoip-api-web-services) may wish to be 
_chained_ together with the results being aggregated within one 
`NutchDocument`. I did not wish to complicate this PR any more though so any 
implementation will be described first in another Jira ticket.




> Upgrade com.maxmind.geoip2:geoip2 dependency in geoip-index to v4.2.0
> ---------------------------------------------------------------------
>
>                 Key: NUTCH-3064
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3064
>             Project: Nutch
>          Issue Type: Task
>          Components: index-geoip, plugin
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 1.21
>
>
> A recent mailing list question about the index-geoip plugin prompted me to 
> take a look at it and perform any necessary maintenance. 
> As of writing, the latest dependency can be found at 
> [https://central.sonatype.com/artifact/com.maxmind.geoip2/geoip2] at v4.2.0.
> At a minimum this ticket will accomplish the dependency update. I'll also 
> have a look at documentation and maybe provide some unit tests... which I 
> neglected to furnish last time around.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to