[GitHub] [spark] cchighman commented on a change in pull request #28841: [SPARK-31962][SQL] Provide option to load files after a specified date when reading from a folder path

GitBox Tue, 16 Jun 2020 14:06:16 -0700


cchighman commented on a change in pull request #28841:
URL: https://github.com/apache/spark/pull/28841#discussion_r441141560




##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/pathfilters/PathFilterIgnoreOldFiles.scala
##########
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.pathfilters
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{Path, PathFilter}
+
+import org.apache.spark.sql.SparkSession
+
+/**
+SPARK-31962 - Provide option to load files after a specified
+date when reading from a folder path.  When specifying the

Review comment:
       + @zsxwing - Author surrounding _latestFirst_ option for structured 
streaming.
   + @HyukjinKwon / @HeartSaVioR - From JIRA comments for above work item.
   
   Having the ability to specify a file modified date as being the starting 
point for loading file data sources from a folder path or as part of structured 
streaming has a lot of value for massive, legacy data setups trying to move 
away from SQL and over to Databricks.  Auto Loader doesn't provide support for 
Azure Data Lake Gen 1 and common patterns for large scale ETL pipelines may 
aggregate extracted data into a folder structure organized by date.  
   
   This means either very well established and mature data lakes have to be 
restructured or we have to build complexity around starting/stopping streams as 
the night ends so that we can target a folder with the current date.  Other 
alternatives mean we dont use SS at all, being more batch-driven, when our key 
need is the ability to start streaming or loading on or after a particular 
date.  This is to say...from this point forward...I want to consume all files 
in this path but nothing prior to this.
   
   This PR only covers the file data source.  I would like to follow up with 
another for SS as both FileDataSource and FileStreamSource both share 
_InMemoryFileIndex_ as their mechanism of action.  
   
   This is my first contribution to Spark and I've been very pleased with how 
well everything is maintained! 
   
   
   
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cchighman commented on a change in pull request #28841: [SPARK-31962][SQL] Provide option to load files after a specified date when reading from a folder path

Reply via email to