I have been scouring the Apache Software foundations project page for a rather complex problem I have. I landed at the "File Processing Patterns" page. I got excited when I read " You can continuously read files or trigger stream and processing pipelines when a file arrives".
I got dismayed when I read " It’s best suited for low-frequency, large, file-size updates." ... My problem is the exact opposite. I have very high frequency, small file size updates. Could you recommend a solution for me? My problem statement is below: =============================================================================================== We've found that Apache Kafka and Azure Event Hubs are not capturing all the events in source databases. I won't bore you with the technical details but that's not a problem that can easily be fixed. We have neither the time nor talent to move databases around and change the configuration of 8500 databases. It's also a huge lift to re-architect and re-deploy the databases. It would be a huge impact to the apps performance too if CDC is enabled. Although the stream does capture most of the events, it does miss some when bulk update operations are performed on tables. To pick up on data that is missed here and there, we run snapshots daily. There are ~8500 copies of the database with about 825 tables each that gets dumped into Apache Parquet. The whole process takes a few hours to run, and Azure Blob storage is the lucky recipient of about 7 Million files everyday from this process. The data is very skewed; some files are very small (<5KB) and can be up to 7GB in others. The moment a file lands in Blob storage, an Azure function fires, and we compare the MD5 checksum of the landed file to the previous days file. If it's the same, or if the file is empty (empty table) we ignore processing the file. But if the MD5 checksum is different, we want to compare the contents and figure out how the table changed. We thought that perhaps coming up with a Python Pandas operation that works in an Azure function that compares one days file to the previous days file could work. We have concerns with memory limits and Azure functions limitation of 5 minutes per function firing making this an impractical solution. >From here we thought of configuring the function to copy differentiated files into an Azure Blob storage that Snowflake watches (snowpipe) and processes. We also thought about using Dataframes in Databricks. Both of these tools do not seem to handle small file sizes well and we have significant concerns around the cost of these tools. They seem great at running a single query against big volumes, they don't seem great at handling oodles of small files for different tables at once horizontally. We thought Azure functions would do better with horizontal operations but run into memory and processing time limits. For the last week or two I have been racking my brain for a solution to this challenge that would make sense and frankly I just don't have one. I thought I would pick your brain and see if you could offer any advice. It seems to be a hybrid of a "many small files" problem and an Amazon Functions post at https://www.confessionsofadataguy.com/why-data-engineers-should-use-aws-lambda-functions/
