KovvuriSriRamaReddy created SPARK-21515:
-------------------------------------------
Summary: Spark ML Random Forest
Key: SPARK-21515
URL: https://issues.apache.org/jira/browse/SPARK-21515
Project: Spark
Issue Type: Question
Components: Build
Affects Versions: 2.1.1
Reporter: KovvuriSriRamaReddy
We are reading data from flat file and storing in DataSet<Row>.
We have one for loop, where we need to modify dataset and use it in next
iteration. [ For first iteration we use original DataSet ]
We all know that variable of type Dataset<Row> is immutable. But the scenario
is, inside the for loop we perform some processing(Random Forest Spark ML)on
this variable(Dataset<Type>) and use the updated result in the next iteration.
This process continues until all the iterations are completed. [ Size of the
dataset is same, only values are changing ]
Approach 1: we are storing the Intermediate result in new DataSet variable and
using it in next iteration.What we have observed is, it took only 1sec to
execute for loop 1st iteration and remaining iterations took more time
exponentially. [ i.e 2nd iteration taking 70sec, 3rd iteration taking 90sec and
so on...]
Approach 2: Wrote intermediate DataSet into HDFS/ external file and read
freshly for each iteration from HDFS/File,then each iteration gets completed
more faster then previous approach.However, writing and reading data to/from
HDFS/external file is taking more time.
This is the problem we have which we need to fine tune.Could anyone please
provide a better solution for this issue?
Note: We are unpersisting & assigning NULL value to previous DataSets at the
end of the loop.
Thanks in advance.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]