[ https://issues.apache.org/jira/browse/SPARK-25348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiangrui Meng updated SPARK-25348: ---------------------------------- Description: It would be useful to have a data source implementation for binary files, which can be used to build features to load images, audio, and videos. Microsoft has an implementation at [https://github.com/Azure/mmlspark/tree/master/src/io/binary.] It would be great if we can merge it into Spark main repo. cc: [~mhamilton] and [~imatiach] Proposed API: Format name: "binary-file" Schema: * content: BinaryType * status (following Hadoop FIleStatus): * path: StringType * modification_time: Timestamp * length: LongType (size limit 2GB) Options: * pathFilterRegex: only include files with path matching the regex pattern * maxBytesPerPartition: The max total file size for each partition unless the partition only contains one file We will also add `binaryFile` to `DataFrameReader` and `DataStreamReader` as convenience aliases. was: It would be useful to have a data source implementation for binary files, which can be used to build features to load images, audio, and videos. Microsoft has an implementation at [https://github.com/Azure/mmlspark/tree/master/src/io/binary.] It would be great if we can merge it into Spark main repo. cc: [~mhamilton] and [~imatiach] > Data source for binary files > ---------------------------- > > Key: SPARK-25348 > URL: https://issues.apache.org/jira/browse/SPARK-25348 > Project: Spark > Issue Type: Story > Components: ML, SQL > Affects Versions: 3.0.0 > Reporter: Xiangrui Meng > Assignee: Weichen Xu > Priority: Major > > It would be useful to have a data source implementation for binary files, > which can be used to build features to load images, audio, and videos. > Microsoft has an implementation at > [https://github.com/Azure/mmlspark/tree/master/src/io/binary.] It would be > great if we can merge it into Spark main repo. > cc: [~mhamilton] and [~imatiach] > Proposed API: > Format name: "binary-file" > Schema: > * content: BinaryType > * status (following Hadoop FIleStatus): > * path: StringType > * modification_time: Timestamp > * length: LongType (size limit 2GB) > Options: > * pathFilterRegex: only include files with path matching the regex pattern > * maxBytesPerPartition: The max total file size for each partition unless the > partition only contains one file > We will also add `binaryFile` to `DataFrameReader` and `DataStreamReader` as > convenience aliases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org