jornfranke opened a new issue, #2360:
URL: https://github.com/apache/sedona/issues/2360

   Thank you for the great integration of libpostal described in 
https://github.com/apache/sedona/issues/2074
   
   I have the following enhancement proposal to make it more usable in an 
enterprise context. The main issue is that in an enterprise context there is 
usually no Internet connectivity available from a Spark cluster and also no 
direct access to the nodes. Thus, it is difficult to use the libpostal 
integration as it needs to download the model from the internet.
   
   Based on the libpostal integration pull request 
https://github.com/apache/sedona/pull/2077, I can see that a config 
"spark.sedona.libpostal.dataDir" is accepted. It defaults into a local tmp-dir, 
because libpostal can only load from a local filesystem.
   
   I propose the following addition:
   Accept a folder on HDFS, object stores (e.g. S3 etc.). If you have a larger 
job with a lot of nodes then it is much more efficient to load from HDFS/object 
stores than the Internet (and Internet may not be available, server down etc.). 
   
   Since libpostal expects a local directory, I propose that if someone puts 
spark.sedona.libpostal.dataDir to, for example, "s3a://blabla/libpostal" that 
it uses the Hadoop dependency of Spark to list the content of the dataDir 
(https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html)
 , e.g. via
   ```
   FileSystem.get(sparkContext.hadoopConfiguration).listFiles()
   ...
   ```
   copy all the files to a local tmp directory using the Filesystem class (if 
not done already) and point libpostal to it.
   
   Additionally, I propose that the documentation in Apache Sedona contains a 
small shell script how to fetch the data via Internet so that a user can upload 
it to HDFS/object store (e.g. S3). Maybe sth. similar to 
https://github.com/openvenues/libpostal/blob/master/src/libpostal_data.in
   
   
   @james-willis 
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to