On Tue, Mar 27, 2012 at 5:22 PM, Franc Carter <franc.car...@sirca.org.au>wrote:
> On Tue, Mar 27, 2012 at 5:09 PM, Harsh J <ha...@cloudera.com> wrote: > >> Franc, >> >> With the given info, all we can tell is that it is possible but we >> can't tell how as we have no idea how your data/dimensions/etc. are >> structured. Being a little more specific would help. >> > > Thanks, I'll go in to more detail. > > We have data for a large number of entities (10's of millions) for 15+ > years with fairly fine grained timestamp (but we could do just day > granularity). > > At the extremes, some queries will need to select a small number of > entities for all 15 years and some queries needing most of the entities for > a small time range. > > Our current architecture (which we are reviewing) stores the data in 'day > files' with a sort that increase the chance that data we want will be close > together. We can then seek inside the files and only retrieve/process the > parts we we need. > > I'd like to avoid Hadoop having to read and process all of every file to > answer queries that don't need all the data. > > Is that clearer ? > I should also add that we know the entities and time range we are interested in at query submission time > > >> It is possible to select and pass the right set of inputs per job, and >> to also implement record readers to only read what is needed >> specifically. This all depends on how your files are structured. >> >> Taking a wild guess, Apache Hive with its columnar storage (RCFile) >> format may also be what you are looking for. >> > > Thanks I'll have a look in to that > > cheers > > >> >> On Tue, Mar 27, 2012 at 11:32 AM, Franc Carter >> <franc.car...@sirca.org.au> wrote: >> > Hi, >> > >> > I'm very new to Hadoop and am working through how we may be able to >> apply >> > it to our data set. >> > >> > One of the things that I am struggling with is understanding if it is >> > possible to pass tell Hadoop that only parts of the input file will be >> > needed for a specific job. The reason I believe I may need this is that >> we >> > have two big dimensions in our data set. Queries may want only one of >> these >> > dimensions and while some un-needed reading is unavoidable there are >> cases >> > that reading the entire data set presents a very significant overhead. >> > >> > Or have I just misunderstood something ;-( >> > >> > thanks >> > >> > -- >> > >> > *Franc Carter* | Systems architect | Sirca Ltd >> > <marc.zianideferra...@sirca.org.au> >> > >> > franc.car...@sirca.org.au | www.sirca.org.au >> > >> > Tel: +61 2 9236 9118 >> > >> > Level 9, 80 Clarence St, Sydney NSW 2000 >> > >> > PO Box H58, Australia Square, Sydney NSW 1215 >> >> >> >> -- >> Harsh J >> > > > > -- > > *Franc Carter* | Systems architect | Sirca Ltd > <marc.zianideferra...@sirca.org.au> > > franc.car...@sirca.org.au | www.sirca.org.au > > Tel: +61 2 9236 9118 > > Level 9, 80 Clarence St, Sydney NSW 2000 > > PO Box H58, Australia Square, Sydney NSW 1215 > > -- *Franc Carter* | Systems architect | Sirca Ltd <marc.zianideferra...@sirca.org.au> franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215