KovvuriSriRamaReddy created SPARK-21515:
-------------------------------------------

             Summary: Spark ML Random Forest
                 Key: SPARK-21515
                 URL: https://issues.apache.org/jira/browse/SPARK-21515
             Project: Spark
          Issue Type: Question
          Components: Build
    Affects Versions: 2.1.1
            Reporter: KovvuriSriRamaReddy


We are reading data from flat file and storing in DataSet<Row>.
We have one for loop, where we need to modify dataset and use it in next 
iteration. [ For first iteration we use original DataSet ] 

We all know that variable of type Dataset<Row> is immutable. But the scenario 
is, inside the for loop we perform some processing(Random Forest Spark ML)on 
this variable(Dataset<Type>) and use the updated result in the next iteration. 
This process continues until all the iterations are completed. [ Size of the 
dataset is same, only values are changing ]

Approach 1: we are storing the Intermediate result in new DataSet variable and 
using it in next iteration.What we have observed is, it took only 1sec to 
execute for loop 1st iteration and remaining iterations took more time 
exponentially. [ i.e 2nd iteration taking 70sec, 3rd iteration taking 90sec and 
so on...]

Approach 2: Wrote intermediate DataSet into HDFS/ external file and read 
freshly for each iteration from HDFS/File,then each iteration gets completed 
more faster then previous approach.However, writing and reading data to/from 
HDFS/external file is taking more time.


This is the problem we have which we need to fine tune.Could anyone please 
provide a better solution for this issue?

Note: We are unpersisting & assigning NULL value to previous DataSets at the 
end of the loop.

Thanks in advance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to