[ https://issues.apache.org/jira/browse/SPARK-25348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16818640#comment-16818640 ]
Xiangrui Meng edited comment on SPARK-25348 at 4/21/19 7:49 PM: ---------------------------------------------------------------- I created follow-up tasks: * Documentation: SPARK-27472 * Filter push down: SPARK-27473 * Content column pruning: SPARK-27534 was (Author: mengxr): I created two follow-up tasks: * Documentation: SPARK-27472 * Filter push down: SPARK-27473 > Data source for binary files > ---------------------------- > > Key: SPARK-25348 > URL: https://issues.apache.org/jira/browse/SPARK-25348 > Project: Spark > Issue Type: Story > Components: SQL > Affects Versions: 3.0.0 > Reporter: Xiangrui Meng > Assignee: Weichen Xu > Priority: Major > Fix For: 3.0.0 > > > It would be useful to have a data source implementation for binary files, > which can be used to build features to load images, audio, and videos. > Microsoft has an implementation at > [https://github.com/Azure/mmlspark/tree/master/src/io/binary.] It would be > great if we can merge it into Spark main repo. > cc: [~mhamilton] and [~imatiach] > Proposed API: > Format name: "binaryFile" > Schema: > * content: BinaryType > * status (following Hadoop FIleStatus): > ** path: StringType > ** modificationTime: Timestamp > ** length: LongType (size limit 2GB) > Options: > * pathGlobFilter: only include files with path matching the glob pattern > Input partition size can be controlled by common SQL confs: maxPartitionBytes > and openCostInBytes -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org