Hey Srikanth, Thanks a lot for the reply! This clarifies many of our understanding regarding partitions. With this in mind we will try to come up with a proposal to tackle https://issues.apache.org/jira/browse/FALCON-511.
Thanks, John 2014-07-23 20:36 GMT-07:00 Srikanth Sundarrajan <[email protected]>: > > are the partition keys values (say country=us or country=uk) need to be > defined before-hand or unbounded?Yes the partition values themselves are > unbounded. > > does the storage location need to have the partition key in themIn most > cases there are time partitions, besides the time partition, there can be > other partition, which are declared in the partition section. So the > partitions ought to be in the path as a variable. It can be skipped if no > consumer has interest in filtering and selecting a section of the data > through the dataIn(input, partitionSpec) function. > > if the partition keys are not in the FileSystem path, how does Falcon > identify a feed partition physical location > If partition keys aren't specified, then Falcon can't use it either in the > file system version of the input. Partitions are only used in two scenarios > by Falcon. 1) When data is partitioned in multiple clusters, they can be > merged into a single location using replication (single target, multiple > source). For this to work, each source should own a partition exclusively. > 2) Data can be selectively consumed by filtering specific partition through > the dataIn() EL expression > RegardsSrikanth Sundarrajan > > > From: [email protected] > > Date: Wed, 23 Jul 2014 17:16:34 -0700 > > Subject: Partitions in Feed definition > > To: [email protected] > > > > Hey all, > > > > Few questions about Partitions: > > > > Partitions in the FEED xml like below: > > > > <partitions> > > <partition name="colo"/> > > <partition name="country"/> > > </partitions> > > > > > > 1. I see these are partition keys; are the partition keys values > > (say country=us or country=uk) need to be defined before-hand or > > unbounded? > > 2. does the storage location need to have the partition key in > > them? Like below (see the colo and country partition keys) > > > > <location path="/data/${colo}/${country}/${YEAR}/${MONTH}/${DAY}" > > type="data"/> > > > > 3. > > > > if the partition keys are not in the FileSystem path, how does > > Falcon identify a feed partition physical location (actually, > > how/where is it used)? I understand if it were HCAT, the Feed > > definition has the partition key-values. > > > > 4. > > > > Are these partition keys and values validated against the > > FileSystem or HCAT locations? > > > > > > > > Partition attribute in the Cluster reference: > > > > Using the example from the documentation page > > < > http://falcon.incubator.apache.org/docs/FalconArchitecture.html#Replication > > > > > > > > 1. What does it mean to specify partitions in a source cluster ? > > 2. vs target cluster? (does it act like a filter to pull only a > > subset of data from source? -- if so how does Falcon know to read the > > subset in Filesystem feed?) > > 3. What data is in sourceCluster1, sourceCluster2 and what location? > > 4. Which path does the replicated data end up in the backupCluster > (target)? > > > > > > A few questions. Hopefully it's something straightforward about > > partitions that I have missed. > > > > > > Thanks for your answers,John > > -- 余守中 John Yu (Yu, Shoou-Jong) Mobile: 650-691-3314
