Re: [CODE4LIB] hands-on workshop on natural language processing & text mining [big data]

Eric Lease Morgan Thu, 16 Nov 2017 13:48:48 -0800

On Nov 16, 2017, at 11:20 AM, Chris Gray <[email protected]> wrote:


>> I’m thinking about a hands-on workshop on natural language processing & text 
>> mining, below, and your feedback is desired.  —ELM
> 
> You might be interested in something I ran across recently.  Aditya 
> Parameswaran (http://data-people.cs.illinois.edu/) gave a talk at our campus 
> recently about the efforts of a group he participates in that is aimed at 
> "simplifying and improving data analytics, i.e., helping users make better 
> use of their data".  He wrote a recent blog post for O'Reilly on "Enabling 
> Data Science for the Majority" 
> (https://www.oreilly.com/ideas/enabling-data-science-for-the-majority), which 
> was the topic of the talk I heard.
> 
> He introduced 3 of the 6 projects his team has been working on: DataSpread, 
> Zenvisage, and OrpheusDB all aimed at what they call "HILDA" -- 
> "human-in-the-loop data analytics".  The 3 projects listed have homes in 
> github and are linked to from Aditya's page: "Quick Project Links".  At the 
> talk, he said they have hosted versions running and they are looking for beta 
> testers.  There is a live demo of DataSpread at 
> http://kite.cs.illinois.edu:8080/.


Chris, thank you for brining this to my attention.

Parameswaran, above, outlines 5 problems with “big data”:

  1. The Excel problem: Over-reliance on spreadsheets
  2. The exploration problem: Not knowing where to look
  3. The data lake problem: Messy cesspools of data
  4. The data versioning problem: Ad-hoc management of analysis 
  5. The learning problem: Hurdles in leveraging machine learning

I can identify with many of these problem, as I suspect many of you can too. So 
many times I see my fellow librarians trying to make sense of a data set with 
only Excel. Heck, they even try to evaluate MARC in this way. Some data just 
does not fit into a single matrix. “Messy” data is also a perennial problem. 
Again, coming back to our bibliographic data, the city of a 260 field might be 
South Bend, IN; South Bend, Ind.; or South Bend. Moreover, parsing the data 
from the records often brings along punctuation. Mr Kilgour’s name was Kilgour, 
Fredrick (1914-2006).

Put another way, yes, I spend a lot of my time dealing with the issues outlined 
above, and I believe such is a possibility for modern librarianship. Now a 
days, find is not nearly as much of a problem to solve. Instead, I believe the 
more pressing problem to solve is enabling people (readers) to use & understand 
the data/information they find. 

—
Eric Morgan

Re: [CODE4LIB] hands-on workshop on natural language processing & text mining [big data]

Reply via email to