GitHub user roshannaik opened a pull request:
https://github.com/apache/storm/pull/936
Functionally complete. Not well tested. Have some UTs
HDFS Spout for initial review (no Trident support).
**Features:**
- Monitors directory. Picks up new files as they show up.
- Built-in support for consuming Text and Sequence File formats
- Extensible to other file formats.
- Works with or without ACKers.
- Concurrent consumption: Each spout instance consumes a different file in
parallel.
- Zookeeper free synchronization: HDFS based synchronization mechanism
enables each spout to concurrently decide which of the available files to
consume next. Spout acquires a lock on a file before consuming it. The lock
files are stored in a .lock subdirectory. Lock file is deleted once file is
consumed.
- Archiving: Completed files are moved to a (configurable) 'done' folder
- Easy progress tracking: Files being consumed are renamed with a
'.inprogress' suffix allowing users to easily check which files are being
processed by simply listing the dir. More more detailed info users can rely on
dashboard metrics.
- Read model : Buffered reads & lazy tuple parsing. Reads are buffered, but
parsing out a tuple is delayed until tuple is needed. Avoids periodic latency
spikes involved with 'eager' parsing (e.g Kafka spout) of the entire batch of
records.
- Rolling window model used to track outstanding unACKed tuples (instead of
Tumbling Window model used in Kafka spout). Tumbling window model prevents new
tuple generation even if one tuple in the batch is not acked.
- Support for clusters where NTP is not used to synchronize machine clocks
(for identifying timed out locks - see *spout failure detection* section
below).
- Kerberos support
**Fault Tolerance :**
- Un-parseable files - are moved to a separate (configurable) directory
- I/O errors - Retry later if currently experiencing I/O errors during read
- Duplicate Delivery mitigation
+ Spout tracks its progress on HDFS. So if it dies, another spout can
resume from where it left off, rather than reprocessing the whole file.
+ Configurable number of outstanding unACKed tuples - helps contain
the amount of duplicate emits when a spout takes over the job of a failed spout.
- Spout Failure and Failover :
+ Spout failure detection - Each spout acquires a lock on the file it is
reading and periodically heartbeats (appends) its progress into the lock file
with a timestamp. This helps identify stale/abandoned locks easily (based on
configurable lock timeout).
+ Failover to another spout - Spouts periodically check for stale locks.
A spout can take ownership of a stale lock and resume reading the file
corresponding to the lock. The latest progress noted in the lock file is used
to decide the location from which to resume.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/roshannaik/storm STORM-1199
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/storm/pull/936.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #936
----
commit 03260cc66a076fc118197f2d7ec8c64bd704d936
Author: Roshan Naik <[email protected]>
Date: 2015-12-09T21:10:32Z
Functionally complete. Not well tested. Have some UTs
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---