Hi Franc Adding on to Harsh's response. If you Partition your data accordingly in Hive you can easily switch on and off full data scans. Partitions and sub partitions(multi level partitions) would help you hit only the required data set. How to partition is totally based on your use cases or the queries that are intended for the data set. If you are looking at sampling then you may need to incorporate Buckets as well. Regards Bejoy KS
Sent from handheld, please excuse typos. -----Original Message----- From: Franc Carter <franc.car...@sirca.org.au> Date: Tue, 27 Mar 2012 17:26:49 To: <common-user@hadoop.apache.org> Reply-To: common-user@hadoop.apache.org Subject: Re: Parts of a file as input On Tue, Mar 27, 2012 at 5:22 PM, Franc Carter <franc.car...@sirca.org.au>wrote: > On Tue, Mar 27, 2012 at 5:09 PM, Harsh J <ha...@cloudera.com> wrote: > >> Franc, >> >> With the given info, all we can tell is that it is possible but we >> can't tell how as we have no idea how your data/dimensions/etc. are >> structured. Being a little more specific would help. >> > > Thanks, I'll go in to more detail. > > We have data for a large number of entities (10's of millions) for 15+ > years with fairly fine grained timestamp (but we could do just day > granularity). > > At the extremes, some queries will need to select a small number of > entities for all 15 years and some queries needing most of the entities for > a small time range. > > Our current architecture (which we are reviewing) stores the data in 'day > files' with a sort that increase the chance that data we want will be close > together. We can then seek inside the files and only retrieve/process the > parts we we need. > > I'd like to avoid Hadoop having to read and process all of every file to > answer queries that don't need all the data. > > Is that clearer ? > I should also add that we know the entities and time range we are interested in at query submission time > > >> It is possible to select and pass the right set of inputs per job, and >> to also implement record readers to only read what is needed >> specifically. This all depends on how your files are structured. >> >> Taking a wild guess, Apache Hive with its columnar storage (RCFile) >> format may also be what you are looking for. >> > > Thanks I'll have a look in to that > > cheers > > >> >> On Tue, Mar 27, 2012 at 11:32 AM, Franc Carter >> <franc.car...@sirca.org.au> wrote: >> > Hi, >> > >> > I'm very new to Hadoop and am working through how we may be able to >> apply >> > it to our data set. >> > >> > One of the things that I am struggling with is understanding if it is >> > possible to pass tell Hadoop that only parts of the input file will be >> > needed for a specific job. The reason I believe I may need this is that >> we >> > have two big dimensions in our data set. Queries may want only one of >> these >> > dimensions and while some un-needed reading is unavoidable there are >> cases >> > that reading the entire data set presents a very significant overhead. >> > >> > Or have I just misunderstood something ;-( >> > >> > thanks >> > >> > -- >> > >> > *Franc Carter* | Systems architect | Sirca Ltd >> > <marc.zianideferra...@sirca.org.au> >> > >> > franc.car...@sirca.org.au | www.sirca.org.au >> > >> > Tel: +61 2 9236 9118 >> > >> > Level 9, 80 Clarence St, Sydney NSW 2000 >> > >> > PO Box H58, Australia Square, Sydney NSW 1215 >> >> >> >> -- >> Harsh J >> > > > > -- > > *Franc Carter* | Systems architect | Sirca Ltd > <marc.zianideferra...@sirca.org.au> > > franc.car...@sirca.org.au | www.sirca.org.au > > Tel: +61 2 9236 9118 > > Level 9, 80 Clarence St, Sydney NSW 2000 > > PO Box H58, Australia Square, Sydney NSW 1215 > > -- *Franc Carter* | Systems architect | Sirca Ltd <marc.zianideferra...@sirca.org.au> franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215