On Sun, Oct 31, 2010 at 10:26 PM, Rob Sargent <robjsarg...@gmail.com> wrote:



> Viktor Bojović wrote:
>
>>
>>
>> On Sun, Oct 31, 2010 at 9:42 PM, Rob Sargent <robjsarg...@gmail.com<mailto:
>> robjsarg...@gmail.com>> wrote:
>>
>>
>>
>>
>>    Viktor Bojovic' wrote:
>>
>>
>>
>>        On Sun, Oct 31, 2010 at 2:26 AM, James Cloos
>>        <cl...@jhcloos.com <mailto:cl...@jhcloos.com>
>>        <mailto:cl...@jhcloos.com <mailto:cl...@jhcloos.com>>> wrote:
>>
>>           >>>>> "VB" == Viktor Bojovic' <viktor.bojo...@gmail.com
>>        <mailto:viktor.bojo...@gmail.com>
>>
>>           <mailto:viktor.bojo...@gmail.com
>>        <mailto:viktor.bojo...@gmail.com>>> writes:
>>
>>           VB> i have very big XML documment which is larger than 50GB and
>>           want to
>>           VB> import it into databse, and transform it to relational
>>        schema.
>>
>>           Were I doing such a conversion, I'd use perl to convert the
>>        xml into
>>           something which COPY can grok. Any other language, script
>>        or compiled,
>>           would work just as well. The goal is to avoid having to
>>        slurp the
>>           whole
>>           xml structure into memory.
>>
>>           -JimC
>>           --
>>           James Cloos <cl...@jhcloos.com <mailto:cl...@jhcloos.com>
>>        <mailto:cl...@jhcloos.com <mailto:cl...@jhcloos.com>>>
>>
>>
>>           OpenPGP: 1024D/ED7DAEA6
>>
>>
>>        The insertion into dabase is not very big problem.
>>        I insert it as XML docs, or as varchar lines or as XML docs in
>>        varchar format. Usually i use transaction and commit after
>>        block of 1000 inserts and it goes very fast. so insertion is
>>        over after few hours.
>>        But the problem occurs when i want to transform it inside
>>        database from XML(varchar or XML format) into tables by parsing.
>>        That processing takes too much time in database no matter if
>>        it is stored as varchar lines, varchar nodes or XML data type.
>>
>>        --         ---------------------------------------
>>        Viktor Bojovic'
>>
>>        ---------------------------------------
>>        Wherever I go, Murphy goes with me
>>
>>
>>    Are you saying you first load the xml into the database, then
>>    parse that xml into instance of objects (rows in tables)?
>>
>>
>> Yes. That way takes less ram then using twig or simple xml, so I tried
>> using postgre xml functions or regexes.
>>
>>
>>
>> --
>> ---------------------------------------
>> Viktor Bojović
>> ---------------------------------------
>> Wherever I go, Murphy goes with me
>>
> Is the entire load a set of "entry" elements as your example contains?
>  This I believe would parse nicely into a tidy but non-trivial schema
> directly without the "middle-man" of having xml in db (unless of course you
> prefer xpath to sql ;) )
>
> The single most significant caveat I would have for you is Beware:
> Biologists involved. Inconsistency (at least overloaded concepts)  almost
> assured :).  EMBL too is suspect imho, but I've been out of that arena for a
> while.
>
>
Unfortunately some elements are always missing, so I had to create script
which scanned whole document of swissprot and trembl , and stored it into
file to use it as a template to build a code generator if I find a best
parser for this purpose. To parse all elements it in one day I should use
parser which is capable to parse at least 128 entry blocks for an second @
2.4GHz. You are right about inconsistency, im constantly have problems with
PDB files.

btw.
you have mentioned "This I believe would parse nicely into a tidy but
non-trivial schema directly", does it mean that postgre has a support for
restoring the database schema from xml files?

-- 
---------------------------------------
Viktor Bojović
---------------------------------------
Wherever I go, Murphy goes with me
entry[]->sequence[]
entry[]->feature[]
entry[]->reference[]
entry[]->feature[]->location[]->position[]->status
entry[]->dbReference[]->property[]
entry[]->reference[]->citation[]->last
entry[]->comment[]->text[]->status
entry[]->geneLocation[]->type
entry[]->comment[]->experiments[]
entry[]->comment[]->conflict[]->sequence[]
entry[]->comment[]->subcellularLocation[]->orientation[]->status
entry[]->protein[]->domain[]->alternativeName[]->fullName[]
entry[]->evidence[]->category
entry[]->feature[]->location[]->begin[]->status
entry[]->reference[]->citation[]->volume
entry[]->feature[]->evidence
entry[]->dbReference[]->type
entry[]->reference[]->citation[]->authorList[]->consortium[]
entry[]->version
entry[]->comment[]->location[]->sequence
entry[]->sequence[]->version
entry[]->proteinExistence[]
entry[]->reference[]->scope[]
entry[]->reference[]->source[]->plasmid[]
entry[]->reference[]->citation[]->dbReference[]
entry[]->comment[]->locationType
entry[]->protein[]->domain[]
entry[]->reference[]->citation[]->publisher
entry[]->gene[]->name[]
entry[]->protein[]->domain[]->alternativeName[]->ref
entry[]->comment[]->conflict[]
entry[]->evidence[]
entry[]->sequence[]->modified
entry[]->comment[]->conflict[]->sequence[]->id
entry[]->keyword[]->id
entry[]->comment[]->redoxPotential[]->evidence
entry[]->comment[]->link[]
entry[]->feature[]->location[]->position[]
entry[]->reference[]->citation[]->dbReference[]->id
entry[]->organismHost[]->dbReference[]->key
entry[]->organism[]->lineage[]
entry[]->organismHost[]->name[]->type
entry[]->comment[]->location[]->position[]->evidence
entry[]->proteinExistence[]->type
entry[]->protein[]->component[]->alternativeName[]->shortName[]
entry[]->reference[]->citation[]->editorList[]->person[]
entry[]->comment[]->isoform[]->sequence[]->type
entry[]->organismHost[]->dbReference[]->type
entry[]->sequence[]->checksum
entry[]->gene[]
entry[]->gene[]->name[]->type
entry[]->dbReference[]->id
entry[]->comment[]->phDependence[]->evidence
entry[]->comment[]->interactant[]
entry[]->dbReference[]->property[]->type
entry[]->protein[]->alternativeName[]->fullName[]
entry[]->protein[]->component[]->recommendedName[]->shortName[]
entry[]->protein[]->allergenName[]
entry[]->comment[]->kinetics[]->KM[]
entry[]->comment[]->absorption[]
entry[]->comment[]->error
entry[]->protein[]->component[]->recommendedName[]
entry[]->comment[]->link[]->uri
entry[]->comment[]->location[]
entry[]->comment[]->location[]->position[]
entry[]->feature[]->location[]->begin[]
entry[]->comment[]->isoform[]->sequence[]->ref
entry[]->protein[]->component[]->alternativeName[]
entry[]->feature[]->location[]->end[]->position
entry[]->reference[]->citation[]->type
entry[]->reference[]->citation[]->authorList[]->consortium[]->name
entry[]->comment[]->name
entry[]->comment[]->location[]->end[]
entry[]->sequence[]->precursor
entry[]->sequence[]->mass
entry[]->evidence[]->date
entry[]->protein[]->component[]->alternativeName[]->ref
entry[]->comment[]->subcellularLocation[]->location[]
entry[]->feature[]->status
entry[]->comment[]->method
entry[]->feature[]->location[]->position[]->position
entry[]->comment[]->interactant[]->intactId
entry[]->feature[]->location[]->end[]->status
entry[]->protein[]->alternativeName[]->shortName[]
entry[]->comment[]->absorption[]->text[]->evidence
entry[]->comment[]->kinetics[]->text[]
entry[]->protein[]->domain[]->recommendedName[]->shortName[]
entry[]->protein[]
entry[]->organism[]->lineage[]->taxon[]
entry[]->protein[]->domain[]->recommendedName[]
entry[]->keyword[]
entry[]->reference[]->citation[]->city
entry[]->protein[]->component[]->recommendedName[]->ref
entry[]->reference[]->source[]->tissue[]
entry[]->comment[]->kinetics[]
entry[]->comment[]->temperatureDependence[]
entry[]->reference[]->citation[]->dbReference[]->type
entry[]->comment[]->type
entry[]->comment[]->conflict[]->sequence[]->resource
entry[]->comment[]->isoform[]->note[]
entry[]->comment[]->subcellularLocation[]->orientation[]
entry[]->comment[]->absorption[]->max[]->evidence
entry[]->comment[]->subcellularLocation[]->topology[]
entry[]->protein[]->recommendedName[]->fullName[]
entry[]->feature[]->id
entry[]->comment[]->subcellularLocation[]->topology[]->status
entry[]->reference[]->citation[]->first
entry[]->dbReference[]
entry[]->comment[]->isoform[]->name[]
entry[]->accession[]
entry[]->feature[]->location[]
entry[]->protein[]->component[]->allergenName[]
entry[]->comment[]->molecule[]
entry[]->comment[]->text[]
entry[]->protein[]->component[]->recommendedName[]->fullName[]
entry[]->comment[]->phDependence[]
entry[]->comment[]->interactant[]->label[]
entry[]->comment[]->isoform[]->sequence[]
entry[]->reference[]->citation[]->authorList[]
entry[]->comment[]->isoform[]
entry[]->sequence[]->fragment
entry[]->feature[]->location[]->end[]
entry[]->reference[]->citation[]->number
entry[]->comment[]->location[]->end[]->status
entry[]->organism[]
entry[]->protein[]->component[]->alternativeName[]->fullName[]
entry[]->organismHost[]->dbReference[]->id
entry[]->protein[]->alternativeName[]
entry[]->reference[]->citation[]->date
entry[]->protein[]->domain[]->alternativeName[]->shortName[]
entry[]->reference[]->citation[]->editorList[]->person[]->name
entry[]->dbReference[]->property[]->value
entry[]->protein[]->domain[]->alternativeName[]
entry[]->reference[]->citation[]->title[]
entry[]->comment[]->subcellularLocation[]
entry[]->comment[]->mass
entry[]->protein[]->component[]
entry[]->reference[]->citation[]->dbReference[]->key
entry[]->protein[]->recommendedName[]->shortName[]
entry[]->organismHost[]->name[]
entry[]->reference[]->source[]->strain[]
entry[]->comment[]->conflict[]->type
entry[]->organism[]->name[]
entry[]->reference[]->citation[]->institute
entry[]->comment[]->evidence
entry[]->feature[]->original[]
entry[]->protein[]->domain[]->recommendedName[]->fullName[]
entry[]->name[]
entry[]->comment[]->location[]->begin[]->position
entry[]->organism[]->dbReference[]
entry[]->reference[]->citation[]
entry[]->organism[]->name[]->type
entry[]->reference[]->source[]
entry[]->organismHost[]->dbReference[]
entry[]->reference[]->citation[]->country
entry[]
entry[]->dbReference[]->key
entry[]->feature[]->ref
entry[]->protein[]->alternativeName[]->ref
entry[]->geneLocation[]->name[]->status
entry[]->comment[]->event[]
entry[]->comment[]->isoform[]->id[]
entry[]->evidence[]->key
entry[]->comment[]->kinetics[]->Vmax[]
entry[]->dataset
entry[]->protein[]->recommendedName[]->ref
entry[]->reference[]->citation[]->authorList[]->person[]
entry[]->comment[]->organismsDiffer[]
entry[]->comment[]->conflict[]->sequence[]->version
entry[]->reference[]->source[]->transposon[]
entry[]->geneLocation[]
entry[]->reference[]->citation[]->db
entry[]->reference[]->citation[]->locator[]
entry[]->organism[]->dbReference[]->key
entry[]->comment[]->conflict[]->ref
entry[]->modified
entry[]->sequence[]->length
entry[]->comment[]->location[]->begin[]->status
entry[]->comment[]->temperatureDependence[]->evidence
entry[]->reference[]->key
entry[]->comment[]->kinetics[]->KM[]->evidence
entry[]->created
entry[]->comment[]->redoxPotential[]
entry[]->comment[]->absorption[]->max[]
entry[]->evidence[]->attribute
entry[]->comment[]->location[]->end[]->position
entry[]->comment[]->kinetics[]->text[]->evidence
entry[]->comment[]
entry[]->organism[]->dbReference[]->type
entry[]->comment[]->subcellularLocation[]->location[]->status
entry[]->comment[]->location[]->position[]->position
entry[]->comment[]->event[]->type
entry[]->reference[]->citation[]->authorList[]->person[]->name
entry[]->feature[]->location[]->begin[]->position
entry[]->reference[]->citation[]->name
entry[]->reference[]->citation[]->editorList[]
entry[]->feature[]->description
entry[]->organism[]->dbReference[]->id
entry[]->protein[]->domain[]->recommendedName[]->ref
entry[]->comment[]->location[]->begin[]
entry[]->protein[]->recommendedName[]
entry[]->feature[]->type
entry[]->organismHost[]
entry[]->comment[]->absorption[]->text[]
entry[]->feature[]->variation[]
entry[]->geneLocation[]->name[]
entry[]->evidence[]->type
entry[]->comment[]->interactant[]->id[]
entry[]->protein[]->cdAntigenName[]
entry[]->sequence[]
entry[]->feature[]
entry[]->reference[]
entry[]->dbReference[]->property[]
entry[]->reference[]->citation[]->last
entry[]->comment[]->text[]->status
entry[]->comment[]->experiments[]
entry[]->geneLocation[]->type
entry[]->feature[]->location[]->begin[]->status
entry[]->evidence[]->category
entry[]->reference[]->citation[]->volume
entry[]->feature[]->evidence
entry[]->dbReference[]->type
entry[]->reference[]->citation[]->authorList[]->consortium[]
entry[]->version
entry[]->sequence[]->version
entry[]->gene[]->name[]->evidence
entry[]->proteinExistence[]
entry[]->reference[]->scope[]
entry[]->reference[]->source[]->plasmid[]
entry[]->reference[]->citation[]->dbReference[]
entry[]->reference[]->citation[]->publisher
entry[]->gene[]->name[]
entry[]->evidence[]
entry[]->reference[]->evidence
entry[]->sequence[]->modified
entry[]->keyword[]->id
entry[]->organismHost[]->dbReference[]->key
entry[]->feature[]->location[]->position[]
entry[]->reference[]->citation[]->dbReference[]->id
entry[]->organismHost[]->name[]->type
entry[]->organism[]->lineage[]
entry[]->proteinExistence[]->type
entry[]->reference[]->citation[]->editorList[]->person[]
entry[]->organismHost[]->dbReference[]->type
entry[]->sequence[]->checksum
entry[]->gene[]
entry[]->gene[]->name[]->type
entry[]->dbReference[]->evidence
entry[]->dbReference[]->id
entry[]->reference[]->source[]->tissue[]->evidence
entry[]->comment[]->interactant[]
entry[]->protein[]->recommendedName[]->fullName[]->evidence
entry[]->dbReference[]->property[]->type
entry[]->feature[]->location[]->begin[]
entry[]->protein[]->submittedName[]->ref
entry[]->feature[]->location[]->end[]->position
entry[]->reference[]->citation[]->type
entry[]->reference[]->citation[]->authorList[]->consortium[]->name
entry[]->sequence[]->precursor
entry[]->sequence[]->mass
entry[]->evidence[]->date
entry[]->comment[]->subcellularLocation[]->location[]
entry[]->feature[]->status
entry[]->feature[]->location[]->position[]->position
entry[]->comment[]->interactant[]->intactId
entry[]->feature[]->location[]->end[]->status
entry[]->protein[]
entry[]->protein[]->submittedName[]->fullName[]
entry[]->organism[]->lineage[]->taxon[]
entry[]->keyword[]
entry[]->reference[]->citation[]->city
entry[]->reference[]->source[]->tissue[]
entry[]->reference[]->citation[]->dbReference[]->type
entry[]->comment[]->type
entry[]->comment[]->subcellularLocation[]->topology[]
entry[]->protein[]->recommendedName[]->fullName[]
entry[]->feature[]->id
entry[]->comment[]->subcellularLocation[]->topology[]->status
entry[]->reference[]->citation[]->first
entry[]->dbReference[]
entry[]->accession[]
entry[]->feature[]->location[]
entry[]->comment[]->molecule[]
entry[]->comment[]->text[]
entry[]->comment[]->interactant[]->label[]
entry[]->reference[]->citation[]->authorList[]
entry[]->sequence[]->fragment
entry[]->reference[]->source[]->plasmid[]->evidence
entry[]->feature[]->location[]->end[]
entry[]->reference[]->citation[]->number
entry[]->organism[]
entry[]->organismHost[]->dbReference[]->id
entry[]->reference[]->citation[]->date
entry[]->dbReference[]->property[]->value
entry[]->reference[]->citation[]->editorList[]->person[]->name
entry[]->reference[]->citation[]->title[]
entry[]->comment[]->subcellularLocation[]
entry[]->reference[]->citation[]->dbReference[]->key
entry[]->organismHost[]->name[]
entry[]->reference[]->source[]->strain[]
entry[]->organism[]->name[]
entry[]->comment[]->evidence
entry[]->reference[]->citation[]->institute
entry[]->name[]
entry[]->organism[]->dbReference[]
entry[]->reference[]->citation[]
entry[]->organism[]->name[]->type
entry[]->reference[]->source[]
entry[]->organismHost[]->dbReference[]
entry[]
entry[]->dbReference[]->key
entry[]->reference[]->citation[]->country
entry[]->geneLocation[]->name[]->status
entry[]->evidence[]->key
entry[]->dataset
entry[]->protein[]->recommendedName[]->ref
entry[]->reference[]->citation[]->authorList[]->person[]
entry[]->keyword[]->evidence
entry[]->comment[]->organismsDiffer[]
entry[]->reference[]->source[]->transposon[]
entry[]->reference[]->citation[]->db
entry[]->geneLocation[]
entry[]->organism[]->dbReference[]->key
entry[]->modified
entry[]->sequence[]->length
entry[]->reference[]->key
entry[]->created
entry[]->geneLocation[]->evidence
entry[]->evidence[]->attribute
entry[]->comment[]
entry[]->organism[]->dbReference[]->type
entry[]->comment[]->subcellularLocation[]->location[]->status
entry[]->protein[]->submittedName[]
entry[]->reference[]->citation[]->authorList[]->person[]->name
entry[]->reference[]->citation[]->name
entry[]->feature[]->location[]->begin[]->position
entry[]->reference[]->source[]->strain[]->evidence
entry[]->feature[]->description
entry[]->reference[]->citation[]->editorList[]
entry[]->organism[]->dbReference[]->id
entry[]->protein[]->submittedName[]->fullName[]->evidence
entry[]->protein[]->recommendedName[]
entry[]->feature[]->type
entry[]->organismHost[]
entry[]->organism[]->evidence
entry[]->evidence[]->type
entry[]->geneLocation[]->name[]
entry[]->comment[]->interactant[]->id[]
-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql

Reply via email to