Re: [How-To] Custom file format as source

Vadim Semenov Mon, 12 Jun 2017 07:03:50 -0700

It should be easy to start with a custom Hadoop InputFormat that reads the
file and creates a `RDD[Row]`, since you know the records size, it should
be pretty easy to make the InputFormat to produce splits, so then you could
read the file in parallel.


On Mon, Jun 12, 2017 at 6:01 AM, OBones <obo...@free.fr> wrote:

> Hello,
>
> I have an application here that generates data files in a custom binary
> format that provides the following information:
>
> Column list, each column has a data type (64 bit integer, 32 bit string
> index, 64 bit IEEE float, 1 byte boolean)
> Catalogs that give modalities for some columns (ie, column 1 contains only
> the following values: A, B, C, D)
> Array for actual data, each row has a fixed size according to the columns.
>
> Here is an example:
>
> Col1, 64bit integer
> Col2, 32bit string index
> Col3, 64bit integer
> Col4, 64bit float
>
> Catalog for Col1 = 10, 20, 30, 40, 50
> Catalog for Col2 = Big, Small, Large, Tall
> Catalog for Col3 = 101, 102, 103, 500, 5000
> Catalog for Col4 = (no catalog)
>
> Data array =
> 8 bytes, 4 bytes, 8 bytes, 8 bytes,
> 8 bytes, 4 bytes, 8 bytes, 8 bytes,
> 8 bytes, 4 bytes, 8 bytes, 8 bytes,
> 8 bytes, 4 bytes, 8 bytes, 8 bytes,
> 8 bytes, 4 bytes, 8 bytes, 8 bytes,
> ...
>
> I would like to use this kind of file as a source for various ML related
> computations (CART, RandomForrest, Gradient boosting...) and Spark is very
> interesting in this area.
> However, I'm a bit lost as to what I should write to have Spark use that
> file format as a source for its computation. Considering that those files
> are quite big (100 million lines, hundreds of gigs on disk), I'd rather not
> create something that writes a new file in a built-in format, but I'd
> rather write some code that makes Spark accept the file as it is.
>
> I looked around and saw the textfile method but it is not applicable to my
> case. I also saw the spark.read.format("libsvm") syntax which tells me that
> there is a list of supported formats known to spark, which I believe are
> called Dataframes, but I could not find any tutorial on this subject.
>
> Would you have any suggestion or links to documentation that would get me
> started?
>
> Regards,
> Olivier
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: [How-To] Custom file format as source

Reply via email to