[
https://issues.apache.org/jira/browse/MAHOUT-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Trevor Grant resolved MAHOUT-2101.
----------------------------------
Resolution: Won't Fix
Jira Cleanup 1/31/24
> Mahout local file distribution
> ------------------------------
>
> Key: MAHOUT-2101
> URL: https://issues.apache.org/jira/browse/MAHOUT-2101
> Project: Mahout
> Issue Type: Improvement
> Reporter: Stefan Goldener
> Priority: Major
>
> At the moment Mahout is heavily based on HDFS. Although MAHOUT_LOCAL is using
> the local File system it is not possible to use MAHOUT_LOCAL=true and a SPARK
> ONLY Cluster.
> My suggestion is to improve the Mahout code to support local files and
> distribute them via SPARK. There are multiple options for that e.g. Spark
> SQL, DataFrames, Datasets or RDD's.
> This will also allow Mahout to use the new SPARK Kubernetes features and
> hence be highly scalable.
> Probably the best improvement would be mahout using spark context and just
> reading the files via sc.textFile("file:///path to the file/")
> This would then look just like this. While the only problem now is just how
> the file is read (the executors cannot find the file because it's only
> existing on the driver)
> {code:sh}
> mahout spark-itemsimilarity -i /tmp/file.txt -o ~/dataout/out.txt -rd ',' -f1
> pur -rc 0 -fc 1 -ic 2 -os -sem 10g -ma k8s://localhost:8080
> -D:spark.dynamicAllocation.enabled=true -D:spark.shuffle.service.enabled=true
> -D:spark.executor.instances=5
> -D:spark.kubernetes.container.image=$CONAINER_IMAGE
> -D:spark.kubernetes.namespace=$NAMESPACE -D:spark.driver.host=$IP
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)