[ 
https://issues.apache.org/jira/browse/MAHOUT-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Trevor Grant resolved MAHOUT-2101.
----------------------------------
    Resolution: Won't Fix

Jira Cleanup 1/31/24

> Mahout local file distribution
> ------------------------------
>
>                 Key: MAHOUT-2101
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-2101
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Stefan Goldener
>            Priority: Major
>
> At the moment Mahout is heavily based on HDFS. Although MAHOUT_LOCAL is using 
> the local File system it is not possible to use MAHOUT_LOCAL=true and a SPARK 
> ONLY Cluster.
> My suggestion is to improve the Mahout code to support local files and 
> distribute them via SPARK. There are multiple options for that e.g. Spark 
> SQL, DataFrames, Datasets or RDD's.
> This will also allow Mahout to use the new SPARK Kubernetes features and 
> hence be highly scalable.
> Probably the best improvement would be mahout using spark context and just 
> reading the files via sc.textFile("file:///path to the file/")
> This would then look just like this. While the only problem now is just how 
> the file is read (the executors cannot find the file because it's only 
> existing on the driver)
> {code:sh}
> mahout spark-itemsimilarity -i /tmp/file.txt -o ~/dataout/out.txt -rd ',' -f1 
> pur -rc 0 -fc 1 -ic 2 -os -sem 10g -ma k8s://localhost:8080 
> -D:spark.dynamicAllocation.enabled=true -D:spark.shuffle.service.enabled=true 
> -D:spark.executor.instances=5 
> -D:spark.kubernetes.container.image=$CONAINER_IMAGE 
> -D:spark.kubernetes.namespace=$NAMESPACE -D:spark.driver.host=$IP 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to