Re: My Hadoop Summit Talk: NASA+BigData

Mattmann, Chris A (388J) Wed, 20 Mar 2013 07:27:23 -0700

Hey Bruce,

A couple points:


On 3/20/13 5:46 AM, "Bruce Barkstrom" <[email protected]> wrote:

>That may be a bit better.
>
>However, it still isn't clear to me how the physics of the instruments
>and of the data processing gets into what users understand they
>can do with the data.

Yeah agreed. At the same time, this is kind of difficult to throw into
a 45 min with 15 mins "techie talk" that I haven't even prepared yet,
and even harder to throw in to a 100 word (what you see on the website)
and 200 word (longer, what I sent you) abstract that they requested.

>
>As I understand Big Data and analytics, it usually appears to using
>a lot of statistics to find unexpected correlations in the data, but
>the techniques aren't looking for causation.  If you're dealing with
>scientific data, you're usually trying to get to physical causation.
>That means, I think, that users need to understand how the
>physics and math constrain what they can do.

++50 agreed.

>
>Let me see if I can identify a more concrete example of a
>concern.  Usually, when we want to deal with physically
>connected phenomena, we want disparate data to be
>observing the same chunk of space at the same time.
>If the Big Data user picks up one piece of data from region
>X_1 and t_1 and then develops a correlation with observations
>with data from X_2 and t_2, where X_1 /= X_2 and t_1 /= t_2,
>it isn't clear why that correlation has anything to do with
>physical causation.  Of, to put it another way, Big Data
>may just give more examples of the "cherry picking"
>climate deniers do when they select data without
>paying attention to the statistical and physical significance
>of their "results".

Totally agree. This is the big difference between card
carrying statisticians a lot of time and *computer science*
oriented *machine learning* people.

>
>So, even though the data rates are large by today's
>standards, I'm not sure that, by itself, is impressive.

Well I have to say it is impressive. Can you show me a disk
that can today write 700 TB/data per second? Or the filesystem
drivers and parallel I/O necessary to software them? Imagine in
astronomy, where they are moving into the time domain, and
away from the "sky is the archive" "so just reobserve next
time" mentality, and thus triage, which is super important,
isn't the main driver and archival is now becoming important,
and necessary in these eventually 700TB/sec producing systems.

There are all sorts of IO, hardware, computer science, and
other advances that we don't have that are needed, and that
these types of examples like the SKA will drive.

OTOH, the sheer infrastructure, domestic and international policy,
investment, and excitement and sense of nationality that many of
these new Big Data systems (especially the SKA) are creating in
their respective countries (e.g., in South Africa), is enough
to at least suggest to my evidence based mind that there is
something impressive here.

>Maybe the relevant example would be all those statistics
>on dams built or tons of steel produced by the Soviet
>Union.  The hype would be more interesting if it could
>talk about what new phenomena or understanding
>these techniques will produce - not just the data rate
>or the total amount of data being produced.

Agreed, lots of data has been generated for a while. However,
the volume (total and discrete); velocity, and variety (in
data types, metadata, etc.) are certainly such that they are
worthy of current study, at least in the area of data management.

>
>Maybe it's just a glorified popularity contest; if so,
>it would seem to be at about the level of interest
>of the new season of "Dancing with the Stars".

Perhaps, but I know you guys are interested in that show :)
Who's not?

>I suppose the hype is necessary to generate the
>funding (which has its uses), but I'm not sure it
>will do as much as a few million sent to appropriate
>super PACs to move the politics of climate change
>along.

Think of this as an IT super PAC for next generation data management
techniques and systems to deal with data volumes and varieties that
we don't have hardware or CS tools to manage yet. I'm not talking
about writing to tape and letting it die the morgue. I'm talking about
even simple things like making it available after you write it to spinning
disk.

Cheers,
Chris

>
>Bruce B.
>
>On Wed, Mar 20, 2013 at 1:16 AM, Mattmann, Chris A (388J) <
>[email protected]> wrote:
>
>> Hey Bruce,
>>
>> Hah!
>>
>> Unfortunately all you get is the short summary through
>> the website which does make it scientifically hard to
>> judge, however, then again this isn't science, it's a
>> glorified popularity contest.
>>
>> I have a little bit more detailed abstract that I wrote up,
>> pasted below (of course the part that they don't use to solicit votes):
>>
>> ---longer abstract
>> The NASA Jet Propulsion Laboratory, California Institute of
>> Technology contributes to many Big Data projects for Earth science such
>>as
>> the
>> U.S. National Climate Assessment (NCA) and for astronomy such as next
>> generation astronomical instruments like the Square Kilometre Array
>>(SKA)
>> that
>> will generate unprecedented volumes of data (700TB/sec!).
>>
>> Through these projects, we are addressing four key
>> challenges critical for the Hadoop community and broader open source Big
>> Data
>> community to consider: (1) unobtrusively integrating science algorithms
>> into
>> large scale processing systems; (2) selecting and deploying high powered
>> data
>> movement technologies for data staging and remote data acquisition;
>> processing,
>> and delivery to our customers and users; (3) better leveraging of cloud
>> computing (storage and processing) technologies in NASA missions; and
>>(4)
>> technologies for automatically and rapidly extracting text and metadata
>> from
>> the file formats, by some estimates ranging from a few thousand to over
>> fifty
>> thousand in total.
>>
>> This talk will focus on those Big Data challenges, how NASA
>> JPL is addressing them both technologically (Hadoop, OODT, Tika, Nutch,
>> Solr)
>> and from a community standpoint (Apache, interacting with open source,
>> etc.).
>> I¹ll also discuss the future of Big Data at JPL and NASA and how others
>> can get
>> Involved.
>> -----
>>
>> You can think of that as the longer version of what I submitted. *grin*
>>
>> Cheers,
>> Chris
>>
>>
>>
>> On 3/19/13 7:20 PM, "Bruce Barkstrom" <[email protected]> wrote:
>>
>> >OK, so you've got a three-word summary of some
>> >hyperbole with Dumbo, the Flying Elephant.
>> >How are you going to deal with the real
>> >scientific constraints on the physics of combining real
>> >measurement technologies and "mashing stuff together"?
>> >
>> >You need to remember that imaging instruments integrate
>> >radiances with spectral responses and Point Spread Function
>> >weighted averages over the FOV of whatever the instrument
>> >was looking at - and that's just the instantaneous (L1 measurement).
>> >If you do orthorectification, you've got variations in the
>>uncertainties
>> >across the image where the parts of the image where you've
>> >increased the resolving power (by putting interpolated points
>> >closer together) and have also increased the noise from the
>> >orthorectification process that acts as a noise multiplier.
>> >
>> >Next, you've got stuff like cloud identification (and rejection or
>> >acceptance) - which depends on spectral response, solar illumination
>> >(during the day) and temperature and cloud property stuff during
>> >the night - and finally, you've got temporal interpolation (not just
>> >creating an average through emission driven by solar illumination
>> >during the day and IR cooling at night.  Where (the hel)l is
>> >the physics that deals with this stuff?  If you do get some
>> >statistical stuff, why should anyone believe it contributes to
>> >our understanding of climate change?
>> >
>> >I won't vote, but you can think of this as my input to your
>> >scientific conscience.
>> >
>> >Bruce B.
>> >
>> >On Tue, Mar 19, 2013 at 7:51 PM, Mattmann, Chris A (388J) <
>> >[email protected]> wrote:
>> >
>> >> Hey Guys,
>> >>
>> >> I proposed a talk for NASA and Big Data at the Hadoop Summit:
>> >>
>> >>
>> >>
>> 
>>http://hadoopsummit2013.uservoice.com/forums/196822-future-of-apache-hado
>> >>op
>> >> 
>>/suggestions/3733470-nasa-science-and-technology-for-big-data-junkies-
>> >>
>> >>
>> >> If you still have votes, and would like to support my talk, I'd
>> >>certainly
>> >> appreciate it!
>> >>
>> >> Thank you for considering.
>> >>
>> >> Cheers,
>> >> Chris Mattmann
>> >> Vote Herder
>> >>
>> >>
>>
>>

Re: My Hadoop Summit Talk: NASA+BigData

Reply via email to