cchighman commented on a change in pull request #28841: URL: https://github.com/apache/spark/pull/28841#discussion_r441141560
########## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/pathfilters/PathFilterIgnoreOldFiles.scala ########## @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.pathfilters + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{Path, PathFilter} + +import org.apache.spark.sql.SparkSession + +/** +SPARK-31962 - Provide option to load files after a specified +date when reading from a folder path. When specifying the Review comment: + @zsxwing - Author surrounding _latestFirst_ option for structured streaming. + @HyukjinKwon / @HeartSaVioR - From JIRA comments for above work item. Having the ability to specify a file modified date as being the starting point for loading file data sources from a folder path or as part of structured streaming has a lot of value for massive, legacy data setups trying to move away from SQL and over to Databricks. Auto Loader doesn't provide support for Azure Data Lake Gen 1 and common patterns for large scale ETL pipelines may aggregate extracted data into a folder structure organized by date. This means either very well established and mature data lakes have to be restructured or we have to build complexity around starting/stopping streams as the night ends so that we can target a folder with the current date. Other alternatives mean we dont use SS at all, being more batch-driven, when our key need is the ability to start streaming or loading on or after a particular date. This is to say...from this point forward...I want to consume all files in this path but nothing prior to this. This PR only covers the file data source. I would like to follow up with another for SS as both FileDataSource and FileStreamSource both share _InMemoryFileIndex_ as their mechanism of action. This is my first contribution to Spark and I've been very pleased with how well everything is maintained! ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
