[ 
https://issues.apache.org/jira/browse/SYSTEMML-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15852566#comment-15852566
 ] 

Deron Eriksson commented on SYSTEMML-1232:
------------------------------------------

cc [~niketanpansare] [~a1singh]

> Migrate stringDataFrameToVectorDataFrame if needed
> --------------------------------------------------
>
>                 Key: SYSTEMML-1232
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1232
>             Project: SystemML
>          Issue Type: Task
>          Components: APIs
>            Reporter: Deron Eriksson
>
> Restore and migrate RDDConvererUtilsExt.stringDataFrameToVectorDataFrame 
> method to Spark 2 (mllib->ml Vector) if needed.
> The RDDConvererUtilsExt.stringDataFrameToVectorDataFrame method was removed 
> by commit 
> https://github.com/apache/incubator-systemml/commit/578e595fdc506fb8a0c0b18c312fe420a406276d.
>   If this method is needed, migrate it to Spark 2.
> Old method:
> {code}
>       public static Dataset<Row> stringDataFrameToVectorDataFrame(SQLContext 
> sqlContext, Dataset<Row> inputDF)
>                       throws DMLRuntimeException {
>               StructField[] oldSchema = inputDF.schema().fields();
>               //create the new schema
>               StructField[] newSchema = new StructField[oldSchema.length];
>               for(int i = 0; i < oldSchema.length; i++) {
>                       String colName = oldSchema[i].name();
>                       newSchema[i] = DataTypes.createStructField(colName, new 
> VectorUDT(), true);
>               }
>               //converter
>               class StringToVector implements Function<Tuple2<Row, Long>, 
> Row> {
>                       private static final long serialVersionUID = 
> -4733816995375745659L;
>                       @Override
>                       public Row call(Tuple2<Row, Long> arg0) throws 
> Exception {
>                               Row oldRow = arg0._1;
>                               int oldNumCols = oldRow.length();
>                               if (oldNumCols > 1) {
>                                       throw new DMLRuntimeException("The row 
> must have at most one column");
>                               }
>                               // parse the various strings. i.e
>                               // ((1.2,4.3, 3.4))  or (1.2, 3.4, 2.2) or (1.2 
> 3.4)
>                               // [[1.2,34.3, 1.2, 1.2]] or [1.2, 3.4] or [1.3 
> 1.2]
>                               Object [] fields = new Object[oldNumCols];
>                               ArrayList<Object> fieldsArr = new 
> ArrayList<Object>();
>                               for (int i = 0; i < oldRow.length(); i++) {
>                                       Object ci=oldRow.get(i);
>                                       if (ci instanceof String) {
>                                               String cis = (String)ci;
>                                               StringBuffer sb = new 
> StringBuffer(cis.trim());
>                                               for (int nid=0; i < 2; i++) { 
> //remove two level nesting
>                                                       if ((sb.charAt(0) == 
> '(' && sb.charAt(sb.length() - 1) == ')') ||
>                                                                       
> (sb.charAt(0) == '[' && sb.charAt(sb.length() - 1) == ']')
>                                                                       ) {
>                                                               
> sb.deleteCharAt(0);
>                                                               
> sb.setLength(sb.length() - 1);
>                                                       }
>                                               }
>                                               //have the replace code
>                                               String ncis = "[" + 
> sb.toString().replaceAll(" *, *", ",") + "]";
>                                               Vector v = Vectors.parse(ncis);
>                                               fieldsArr.add(v);
>                                       } else {
>                                               throw new 
> DMLRuntimeException("Only String is supported");
>                                       }
>                               }
>                               Row row = 
> RowFactory.create(fieldsArr.toArray());
>                               return row;
>                       }
>               }
>               //output DF
>               JavaRDD<Row> newRows = 
> inputDF.rdd().toJavaRDD().zipWithIndex().map(new StringToVector());
>               // DataFrame outDF = sqlContext.createDataFrame(newRows, new 
> StructType(newSchema)); //TODO investigate why it doesn't work
>               Dataset<Row> outDF = sqlContext.createDataFrame(newRows.rdd(),
>                               DataTypes.createStructType(newSchema));
>               return outDF;
>       }
> {code}
> Note: the org.apache.spark.ml.linalg.Vectors.parse() method does not exist in 
> Spark 2.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to