[jira] [Updated] (MAHOUT-2101) Mahout local file distribution

Stefan Goldener (Jira) Wed, 25 Mar 2020 01:07:32 -0700


     [ 
https://issues.apache.org/jira/browse/MAHOUT-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Stefan Goldener updated MAHOUT-2101:
------------------------------------
    Description: 
At the moment Mahout is heavily based on HDFS. Although MAHOUT_LOCAL is using 
the local File system it is not possible to use MAHOUT_LOCAL=true and a SPARK 
ONLY Cluster.

My suggestion is to improve the Mahout code to support local files and 
distribute them via SPARK. There are multiple options for that e.g. Spark SQL, 
DataFrames, Datasets or RDD's.

This will also allow Mahout to use the new SPARK Kubernetes features and hence 
be highly scalable.

Probably the best improvement would be mahout using spark context and just 
reading the files via sc.textFile("file:///path to the file/")

This would then look just like this. While the only problem now is just how the 
file is read (the executors cannot find the file because it's only existing on 
the driver)
{code:sh}
mahout spark-itemsimilarity -i /tmp/file.txt -o ~/dataout/out.txt -rd ',' -f1 
pur -rc 0 -fc 1 -ic 2 -os -sem 10g -ma k8s://localhost:8080 
-D:spark.dynamicAllocation.enabled=true -D:spark.shuffle.service.enabled=true 
-D:spark.executor.instances=5 
-D:spark.kubernetes.container.image=$CONAINER_IMAGE 
-D:spark.kubernetes.namespace=$NAMESPACE -D:spark.driver.host=$IP 
{code}



  was:
At the moment Mahout is heavily based on HDFS. Although MAHOUT_LOCAL is using 
the local File system it is not possible to use MAHOUT_LOCAL=true and a SPARK 
ONLY Cluster.

My suggestion is to improve the Mahout code to support local files and 
distribute them via SPARK. There are multiple options for that e.g. Spark SQL, 
DataFrames, Datasets or RDD's.

This will also allow Mahout to use the new SPARK Kubernetes features and hence 
be highly scalable.

Probably the best improvement would be mahout using spark context and just 
reading the files via sc.textFile("file:///path to the file/")


> Mahout local file distribution
> ------------------------------
>
>                 Key: MAHOUT-2101
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-2101
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Stefan Goldener
>            Priority: Major
>
> At the moment Mahout is heavily based on HDFS. Although MAHOUT_LOCAL is using 
> the local File system it is not possible to use MAHOUT_LOCAL=true and a SPARK 
> ONLY Cluster.
> My suggestion is to improve the Mahout code to support local files and 
> distribute them via SPARK. There are multiple options for that e.g. Spark 
> SQL, DataFrames, Datasets or RDD's.
> This will also allow Mahout to use the new SPARK Kubernetes features and 
> hence be highly scalable.
> Probably the best improvement would be mahout using spark context and just 
> reading the files via sc.textFile("file:///path to the file/")
> This would then look just like this. While the only problem now is just how 
> the file is read (the executors cannot find the file because it's only 
> existing on the driver)
> {code:sh}
> mahout spark-itemsimilarity -i /tmp/file.txt -o ~/dataout/out.txt -rd ',' -f1 
> pur -rc 0 -fc 1 -ic 2 -os -sem 10g -ma k8s://localhost:8080 
> -D:spark.dynamicAllocation.enabled=true -D:spark.shuffle.service.enabled=true 
> -D:spark.executor.instances=5 
> -D:spark.kubernetes.container.image=$CONAINER_IMAGE 
> -D:spark.kubernetes.namespace=$NAMESPACE -D:spark.driver.host=$IP 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (MAHOUT-2101) Mahout local file distribution

Reply via email to