[ https://issues.apache.org/jira/browse/SPARK-25348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16813834#comment-16813834 ]
Xiangrui Meng commented on SPARK-25348: --------------------------------------- Sampling could be supported later. > Data source for binary files > ---------------------------- > > Key: SPARK-25348 > URL: https://issues.apache.org/jira/browse/SPARK-25348 > Project: Spark > Issue Type: Story > Components: ML, SQL > Affects Versions: 3.0.0 > Reporter: Xiangrui Meng > Assignee: Weichen Xu > Priority: Major > > It would be useful to have a data source implementation for binary files, > which can be used to build features to load images, audio, and videos. > Microsoft has an implementation at > [https://github.com/Azure/mmlspark/tree/master/src/io/binary.] It would be > great if we can merge it into Spark main repo. > cc: [~mhamilton] and [~imatiach] > Proposed API: > Format name: "binary-file" > Schema: > * content: BinaryType > * status (following Hadoop FIleStatus): > ** path: StringType > ** modification_time: Timestamp > ** length: LongType (size limit 2GB) > Options: > * pathFilterRegex: only include files with path matching the regex pattern > * maxBytesPerPartition: The max total file size for each partition unless the > partition only contains one file > We will also add `binaryFile` to `DataFrameReader` and `DataStreamReader` as > convenience aliases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org