File processing patterns

Brad Hines Tue, 20 Jul 2021 09:15:17 -0700

I have been scouring the Apache Software foundations project page for a
rather complex problem I have. I landed at the "File Processing Patterns"
page. I got excited when I read " You can continuously read files or
trigger stream and processing pipelines when a file arrives".


I got dismayed when I read " It’s best suited for low-frequency, large,
file-size updates." ... My problem is the exact opposite. I have very high
frequency, small file size updates.

Could you recommend a solution for me? My problem statement is below:

===============================================================================================
We've found that Apache Kafka and Azure Event Hubs are not capturing all
the events in source databases. I won't bore you with the technical details
but that's not a problem that can easily be fixed. We have neither the time
nor talent to move databases around and change the configuration of 8500
databases. It's also a huge lift to re-architect and re-deploy the
databases. It would be a huge impact to the apps performance too if CDC is
enabled. Although the stream does capture most of the events, it does miss
some when bulk update operations are performed on tables.

To pick up on data that is missed here and there, we run snapshots daily.
There are ~8500 copies of the database with about 825 tables each that gets
dumped into Apache Parquet. The whole process takes a few hours to run, and
Azure Blob storage is the lucky recipient of about 7 Million files everyday
from this process. The data is very skewed; some files are very small
(<5KB) and can be up to 7GB in others.

The moment a file lands in Blob storage, an Azure function fires, and we
compare the MD5 checksum of the landed file to the previous days file. If
it's the same, or if the file is empty (empty table) we ignore processing
the file.  But if the MD5 checksum is different, we want to compare the
contents and figure out how the table changed. We thought that perhaps
coming up with a Python Pandas operation that works in an Azure function
that compares one days file to the previous days file could work. We have
concerns with memory limits and Azure functions limitation of 5 minutes per
function firing making this an impractical solution.

>From here we thought of configuring the function to copy differentiated
files into an Azure Blob storage that Snowflake watches (snowpipe) and
processes. We also thought about using Dataframes in Databricks. Both of
these tools do not seem to handle small file sizes well and we have
significant concerns around the cost of these tools. They seem great at
running a single query against big volumes, they don't seem great at
handling oodles of small files for different tables at once horizontally.
We thought Azure functions would do better with horizontal operations but
run into memory and processing time limits.

For the last week or two I have been racking my brain for a solution to
this challenge that would make sense and frankly I just don't have one. I
thought I would pick your brain and see if you could offer any advice. It
seems to be a hybrid of a "many small files" problem and an Amazon
Functions post at
https://www.confessionsofadataguy.com/why-data-engineers-should-use-aws-lambda-functions/

File processing patterns

Reply via email to