Re: [FRIAM] [ SPAM ] Re: Fwd: Share Your Knowledge: Taxonomy Boot Camp

Owen Densmore Fri, 20 Feb 2015 16:12:06 -0800

My first tech job was a Patent database for Xerox in the early '70s.  It
was a huge database and we had professional librarians manage the taxonomy
for it.


At the same time we had a pretty sound "natural language" search, much like
modern DocTerm matrices along with SVD minimization to make the dictionary
tractable.  To search, you simply typed a phrase that defines the patent as
best you can. The search text was simply added to the document space and a
nearest neighborhood was returned (simple cosine distance was the default).

The document space was an n-dimensional cartesian space where the axes were
the dictionary entries and the docs given a vector into the space where
each vector dimension was the number of times a given dictionary term was
in the document.

Telling researchers to type a small paragraph as if they were writing the
patent they were looking for worked really well, they "got it" immediately
and improved significantly with use.

This vastly outperformed the taxonomy, although we kept the taxonomy
indexing for many years.  The issues were pretty obvious
- Language Drift was huge, especially in the research community and they
know it and compensated brilliantly.
- Difficulty of hand building the taxonomy
- Taxonomy ambiguity and conflict.
- Required the librarians to help the researchers search.  They despised
this!
- And it simply failed, the librarians and researchers looked at the
documentation in entirely different ways.
- Obsolescence: the taxonomy simply suffered from age over a decade or so.
Natural language was easier for researchers to manage and they succeeded
more than with a taxonomy.
- The taxonomy ended up in a tree-like search, but failed because is
computer science under science -> computer, or computer -> science.  That's
not a great example but you see the point. Graph searching was tried with
the taxonomy with little success.
- Cost

Yahoo had the same problem in the early days of web search, attempting to
use a hand-built taxonomy.

It does turn out that the combination of natural language searching, along
with categorization, is the best approach, it lets "Industry mushrooms in
New Mexico" easily be separated into categories, even to one is false. But
great for drill-down in general.

   -- Owen

On Fri, Feb 20, 2015 at 4:02 PM, Steve Smith <[email protected]> wrote:

>
>  Interesting topic here, at least to me.  Has anyone ever attended
>>> this?
>>>
>> Have not.  Some folks, like catalogers & librarians are good at this
>> sort of thing, it seems very tedious and hard to scale.
>>
>>
>>  From my limited experience/observation, it is a sticky and subtle
> problem.
>
> SpindleViz:  Over 10 years ago, I worked with a team doing ontology
> modeling to help them Visualize ontologies.  We produced a prototype,
> dynamic 3D visualizer (SpindleViz) which gave some traction on actually
> understanding the structure of a given Ontology, but the project more
> importantly gave me an understanding how ontologies are used and built in
> some communities.   In this case we worked with the Gene Ontology which at
> the time was perhaps the largest and most mature and represented a very
> broad collaborative effort.  The effort of building a shared ontology
> appeared to me to be the ultimate in compromise.
>
> NSF Scientific Collaboration:   Later I found myself working with Dr.
> Deana Pennington at UNM on a NSF project for developing formal tools for
> Scientific Collaboration called SciDesign.   This project included a study
> of the problem of normalizing terminologies across a diverse team of
> Scientists working on a common problem.  In this case climate change.
>  Contrary to some assumptions, the language across seemingly related
> disciplines such as say Atmospheric and Ocean Science or Biology and
> Ecology is not just aligned, but perhaps insidiously counter-aligned, or
> maybe more to the point in some sense "dissonant".   Science, in it's
> pursuit of both understanding and precision draws it's language from
> existing disciplines for the "similarity" to the topic or idea at hand but
> then in the pursuit of precision, changes the meaning of the terms in often
> fundamental if subtle ways which are often not obvious to the discipline
> from which the terms are adopted.  More often two related disciplines
> derive terms from a root source and neither understands how the *other*
> uses them differently.
>
> In pursuit of a methodology to improve Scientific Collaboration in
> general, one of the fundamental problems was to come up with a fairly
> simple methodology to normalize these differences in lexicons.   Of course,
> underneath these lexicons were implicit ontologies, the complex
> relationships between the terms.  We discussed adapting a technique
> developed by Dr. Tim Goldsmith (also UNM) to help with this.   The basic
> concept was to interview each individual on a collaborative team, first for
> a set of "most common terms" used in their domain.   Once these terms were
> acquired for say 6 individuals with related but different domains.   The
> pool of terms would be reduced to the subset of those which recurred in two
> or more individual's lexicons.   Each individual would then be presented
> with a matrix of these terms registered against eachother and they would be
> asked to provide a measure of correlation between each pair of terms.   The
> idea of course, was to build a very rough model of their model as it were,
> to get a handle on how closely aligned each practicioner's model of the
> implicit domain they were studying was.    The result was to be a set of
> weighted graphs of overlapping terms used in their domains when applied to
> the common problem.   While this is not a formal ontology, one might think
> of it as a proto-ontology of sorts, a place to begin to build an ontology
> from.
>
> The point of this was a methodology for "just in time" proto-ontology
> building.   Of course, the funding for this work ran out, Dr. Pennington
> moved to UTEP, and as far as I know things in this area have been on hold
> since then.
>
> Most recently, I worked with other UNM Researchers, Dr's Caudell,
> Gilfeather, Lugar, Taha, et al on a project ultimately entitled "Faceted
> Ontologies" which was primarily about building, from open source
> Intelligence, knowledge structures, developing a normalized model for them,
> and providing tools for extracting specific aggregate knowledge *from*
> those sources, and very specifically presented *as* a structure, not simply
> a list of factoids or simple linear report.   The tools from my former two
> projects were to be developed further to support the visualization, as it
> were, from multiple conceptual viewpoints (aka "facets" of the ontology).
> This was a *very* ambitious project and the basic underpinnings (building
> formal models of ontologies  on top of Category Theory) were done.
>
> I still believe that there is good work to be done in this area, but the
> level of sophistication required to develop the mechanisms underlying my
> own part is pretty daunting.    I occasionally scan the literature and SBIR
> solicitations for new developments and funding sources for this work...  It
> would be very welcome if anyone here happened to have some traction in this
> domain... I can provide a few references, unfortunately most of the results
> out of the second two projects were merely internal reports to the
> customers and very preliminary white-papers.
>
> The domain I find this work most interesting *for* perhaps is
> Journalism...   but the problem is exacerbated by their being much less
> formal languages developed (to my knowledge) across journalism... perhaps
> that is changing, or perhaps the demands of scientific journalism at least
> lead journalists as "outsiders" and "laymen" to the fields to not only do
> this same task intuitively but to have some of their own formal
> methodologies and tools?
>
> - Steve
>
>
> ============================================================
> FRIAM Applied Complexity Group listserv
> Meets Fridays 9a-11:30 at cafe at St. John's College
> to unsubscribe http://redfish.com/mailman/listinfo/friam_redfish.com
>

============================================================
FRIAM Applied Complexity Group listserv
Meets Fridays 9a-11:30 at cafe at St. John's College
to unsubscribe http://redfish.com/mailman/listinfo/friam_redfish.com

Re: [FRIAM] [ SPAM ] Re: Fwd: Share Your Knowledge: Taxonomy Boot Camp

Reply via email to