[ 
https://issues.apache.org/jira/browse/SPARK-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14608685#comment-14608685
 ] 

Stephen Carman commented on SPARK-8449:
---------------------------------------

Moving the discussion back here, here is what we've found so far...

Saddle's Scala HDF5 Implementation (Native Libraries needed)
https://github.com/saddle/saddle/tree/master/saddle-hdf5

Netcdf's implementation, not even sure if it's native or not 
https://github.com/Unidata/thredds

The thing with the saddle implementation is that they have jhdf5 bindings 
another dependency, but they don't include the native libraries in their 
builds, you have to get the native libraries yourself and make sure they're on 
the classpath to be used.

It seems like from a development standpoint the saddle way is easier, but puts 
more cost on the user to have the native libraries and it also doesn't mess 
with spark's build having to include native libraries in the build. The netcdf 
method is what I think pure java so it should be fine to port into scala, but 
at the cost of more development time. I also think it'll be harder to maintain 
on the spark side since we'll be writing almost our own implementation.

I think the easier path is to have the user install the hdf5 native libaries to 
use the functionality in spark, but if the greater development team thinks a 
pure implementation is the way to go I'll get started there.

Do we have any simple explanation of the hdf5 file format and it's encoding? So 
I can use some alternative tools to convert files to more readable format?

> HDF5 read/write support for Spark MLlib
> ---------------------------------------
>
>                 Key: SPARK-8449
>                 URL: https://issues.apache.org/jira/browse/SPARK-8449
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.4.0
>            Reporter: Alexander Ulanov
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Add support for reading and writing HDF5 file format to/from LabeledPoint. 
> HDFS and local file system have to be supported. Other Spark formats to be 
> discussed. 
> Interface proposal:
> /* path - directory path in any Hadoop-supported file system URI */
> MLUtils.saveAsHDF5(sc: SparkContext, path: String, RDD[LabeledPoint]): Unit
> /* path - file or directory path in any Hadoop-supported file system URI */
> MLUtils.loadHDF5(sc: SparkContext, path: String): RDD[LabeledPoint]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to