On Sun, Oct 31, 2010 at 10:26 PM, Rob Sargent <robjsarg...@gmail.com> wrote:
> Viktor Bojović wrote: > >> >> >> On Sun, Oct 31, 2010 at 9:42 PM, Rob Sargent <robjsarg...@gmail.com<mailto: >> robjsarg...@gmail.com>> wrote: >> >> >> >> >> Viktor Bojovic' wrote: >> >> >> >> On Sun, Oct 31, 2010 at 2:26 AM, James Cloos >> <cl...@jhcloos.com <mailto:cl...@jhcloos.com> >> <mailto:cl...@jhcloos.com <mailto:cl...@jhcloos.com>>> wrote: >> >> >>>>> "VB" == Viktor Bojovic' <viktor.bojo...@gmail.com >> <mailto:viktor.bojo...@gmail.com> >> >> <mailto:viktor.bojo...@gmail.com >> <mailto:viktor.bojo...@gmail.com>>> writes: >> >> VB> i have very big XML documment which is larger than 50GB and >> want to >> VB> import it into databse, and transform it to relational >> schema. >> >> Were I doing such a conversion, I'd use perl to convert the >> xml into >> something which COPY can grok. Any other language, script >> or compiled, >> would work just as well. The goal is to avoid having to >> slurp the >> whole >> xml structure into memory. >> >> -JimC >> -- >> James Cloos <cl...@jhcloos.com <mailto:cl...@jhcloos.com> >> <mailto:cl...@jhcloos.com <mailto:cl...@jhcloos.com>>> >> >> >> OpenPGP: 1024D/ED7DAEA6 >> >> >> The insertion into dabase is not very big problem. >> I insert it as XML docs, or as varchar lines or as XML docs in >> varchar format. Usually i use transaction and commit after >> block of 1000 inserts and it goes very fast. so insertion is >> over after few hours. >> But the problem occurs when i want to transform it inside >> database from XML(varchar or XML format) into tables by parsing. >> That processing takes too much time in database no matter if >> it is stored as varchar lines, varchar nodes or XML data type. >> >> -- --------------------------------------- >> Viktor Bojovic' >> >> --------------------------------------- >> Wherever I go, Murphy goes with me >> >> >> Are you saying you first load the xml into the database, then >> parse that xml into instance of objects (rows in tables)? >> >> >> Yes. That way takes less ram then using twig or simple xml, so I tried >> using postgre xml functions or regexes. >> >> >> >> -- >> --------------------------------------- >> Viktor Bojović >> --------------------------------------- >> Wherever I go, Murphy goes with me >> > Is the entire load a set of "entry" elements as your example contains? > This I believe would parse nicely into a tidy but non-trivial schema > directly without the "middle-man" of having xml in db (unless of course you > prefer xpath to sql ;) ) > > The single most significant caveat I would have for you is Beware: > Biologists involved. Inconsistency (at least overloaded concepts) almost > assured :). EMBL too is suspect imho, but I've been out of that arena for a > while. > > Unfortunately some elements are always missing, so I had to create script which scanned whole document of swissprot and trembl , and stored it into file to use it as a template to build a code generator if I find a best parser for this purpose. To parse all elements it in one day I should use parser which is capable to parse at least 128 entry blocks for an second @ 2.4GHz. You are right about inconsistency, im constantly have problems with PDB files. btw. you have mentioned "This I believe would parse nicely into a tidy but non-trivial schema directly", does it mean that postgre has a support for restoring the database schema from xml files? -- --------------------------------------- Viktor Bojović --------------------------------------- Wherever I go, Murphy goes with me
entry[]->sequence[] entry[]->feature[] entry[]->reference[] entry[]->feature[]->location[]->position[]->status entry[]->dbReference[]->property[] entry[]->reference[]->citation[]->last entry[]->comment[]->text[]->status entry[]->geneLocation[]->type entry[]->comment[]->experiments[] entry[]->comment[]->conflict[]->sequence[] entry[]->comment[]->subcellularLocation[]->orientation[]->status entry[]->protein[]->domain[]->alternativeName[]->fullName[] entry[]->evidence[]->category entry[]->feature[]->location[]->begin[]->status entry[]->reference[]->citation[]->volume entry[]->feature[]->evidence entry[]->dbReference[]->type entry[]->reference[]->citation[]->authorList[]->consortium[] entry[]->version entry[]->comment[]->location[]->sequence entry[]->sequence[]->version entry[]->proteinExistence[] entry[]->reference[]->scope[] entry[]->reference[]->source[]->plasmid[] entry[]->reference[]->citation[]->dbReference[] entry[]->comment[]->locationType entry[]->protein[]->domain[] entry[]->reference[]->citation[]->publisher entry[]->gene[]->name[] entry[]->protein[]->domain[]->alternativeName[]->ref entry[]->comment[]->conflict[] entry[]->evidence[] entry[]->sequence[]->modified entry[]->comment[]->conflict[]->sequence[]->id entry[]->keyword[]->id entry[]->comment[]->redoxPotential[]->evidence entry[]->comment[]->link[] entry[]->feature[]->location[]->position[] entry[]->reference[]->citation[]->dbReference[]->id entry[]->organismHost[]->dbReference[]->key entry[]->organism[]->lineage[] entry[]->organismHost[]->name[]->type entry[]->comment[]->location[]->position[]->evidence entry[]->proteinExistence[]->type entry[]->protein[]->component[]->alternativeName[]->shortName[] entry[]->reference[]->citation[]->editorList[]->person[] entry[]->comment[]->isoform[]->sequence[]->type entry[]->organismHost[]->dbReference[]->type entry[]->sequence[]->checksum entry[]->gene[] entry[]->gene[]->name[]->type entry[]->dbReference[]->id entry[]->comment[]->phDependence[]->evidence entry[]->comment[]->interactant[] entry[]->dbReference[]->property[]->type entry[]->protein[]->alternativeName[]->fullName[] entry[]->protein[]->component[]->recommendedName[]->shortName[] entry[]->protein[]->allergenName[] entry[]->comment[]->kinetics[]->KM[] entry[]->comment[]->absorption[] entry[]->comment[]->error entry[]->protein[]->component[]->recommendedName[] entry[]->comment[]->link[]->uri entry[]->comment[]->location[] entry[]->comment[]->location[]->position[] entry[]->feature[]->location[]->begin[] entry[]->comment[]->isoform[]->sequence[]->ref entry[]->protein[]->component[]->alternativeName[] entry[]->feature[]->location[]->end[]->position entry[]->reference[]->citation[]->type entry[]->reference[]->citation[]->authorList[]->consortium[]->name entry[]->comment[]->name entry[]->comment[]->location[]->end[] entry[]->sequence[]->precursor entry[]->sequence[]->mass entry[]->evidence[]->date entry[]->protein[]->component[]->alternativeName[]->ref entry[]->comment[]->subcellularLocation[]->location[] entry[]->feature[]->status entry[]->comment[]->method entry[]->feature[]->location[]->position[]->position entry[]->comment[]->interactant[]->intactId entry[]->feature[]->location[]->end[]->status entry[]->protein[]->alternativeName[]->shortName[] entry[]->comment[]->absorption[]->text[]->evidence entry[]->comment[]->kinetics[]->text[] entry[]->protein[]->domain[]->recommendedName[]->shortName[] entry[]->protein[] entry[]->organism[]->lineage[]->taxon[] entry[]->protein[]->domain[]->recommendedName[] entry[]->keyword[] entry[]->reference[]->citation[]->city entry[]->protein[]->component[]->recommendedName[]->ref entry[]->reference[]->source[]->tissue[] entry[]->comment[]->kinetics[] entry[]->comment[]->temperatureDependence[] entry[]->reference[]->citation[]->dbReference[]->type entry[]->comment[]->type entry[]->comment[]->conflict[]->sequence[]->resource entry[]->comment[]->isoform[]->note[] entry[]->comment[]->subcellularLocation[]->orientation[] entry[]->comment[]->absorption[]->max[]->evidence entry[]->comment[]->subcellularLocation[]->topology[] entry[]->protein[]->recommendedName[]->fullName[] entry[]->feature[]->id entry[]->comment[]->subcellularLocation[]->topology[]->status entry[]->reference[]->citation[]->first entry[]->dbReference[] entry[]->comment[]->isoform[]->name[] entry[]->accession[] entry[]->feature[]->location[] entry[]->protein[]->component[]->allergenName[] entry[]->comment[]->molecule[] entry[]->comment[]->text[] entry[]->protein[]->component[]->recommendedName[]->fullName[] entry[]->comment[]->phDependence[] entry[]->comment[]->interactant[]->label[] entry[]->comment[]->isoform[]->sequence[] entry[]->reference[]->citation[]->authorList[] entry[]->comment[]->isoform[] entry[]->sequence[]->fragment entry[]->feature[]->location[]->end[] entry[]->reference[]->citation[]->number entry[]->comment[]->location[]->end[]->status entry[]->organism[] entry[]->protein[]->component[]->alternativeName[]->fullName[] entry[]->organismHost[]->dbReference[]->id entry[]->protein[]->alternativeName[] entry[]->reference[]->citation[]->date entry[]->protein[]->domain[]->alternativeName[]->shortName[] entry[]->reference[]->citation[]->editorList[]->person[]->name entry[]->dbReference[]->property[]->value entry[]->protein[]->domain[]->alternativeName[] entry[]->reference[]->citation[]->title[] entry[]->comment[]->subcellularLocation[] entry[]->comment[]->mass entry[]->protein[]->component[] entry[]->reference[]->citation[]->dbReference[]->key entry[]->protein[]->recommendedName[]->shortName[] entry[]->organismHost[]->name[] entry[]->reference[]->source[]->strain[] entry[]->comment[]->conflict[]->type entry[]->organism[]->name[] entry[]->reference[]->citation[]->institute entry[]->comment[]->evidence entry[]->feature[]->original[] entry[]->protein[]->domain[]->recommendedName[]->fullName[] entry[]->name[] entry[]->comment[]->location[]->begin[]->position entry[]->organism[]->dbReference[] entry[]->reference[]->citation[] entry[]->organism[]->name[]->type entry[]->reference[]->source[] entry[]->organismHost[]->dbReference[] entry[]->reference[]->citation[]->country entry[] entry[]->dbReference[]->key entry[]->feature[]->ref entry[]->protein[]->alternativeName[]->ref entry[]->geneLocation[]->name[]->status entry[]->comment[]->event[] entry[]->comment[]->isoform[]->id[] entry[]->evidence[]->key entry[]->comment[]->kinetics[]->Vmax[] entry[]->dataset entry[]->protein[]->recommendedName[]->ref entry[]->reference[]->citation[]->authorList[]->person[] entry[]->comment[]->organismsDiffer[] entry[]->comment[]->conflict[]->sequence[]->version entry[]->reference[]->source[]->transposon[] entry[]->geneLocation[] entry[]->reference[]->citation[]->db entry[]->reference[]->citation[]->locator[] entry[]->organism[]->dbReference[]->key entry[]->comment[]->conflict[]->ref entry[]->modified entry[]->sequence[]->length entry[]->comment[]->location[]->begin[]->status entry[]->comment[]->temperatureDependence[]->evidence entry[]->reference[]->key entry[]->comment[]->kinetics[]->KM[]->evidence entry[]->created entry[]->comment[]->redoxPotential[] entry[]->comment[]->absorption[]->max[] entry[]->evidence[]->attribute entry[]->comment[]->location[]->end[]->position entry[]->comment[]->kinetics[]->text[]->evidence entry[]->comment[] entry[]->organism[]->dbReference[]->type entry[]->comment[]->subcellularLocation[]->location[]->status entry[]->comment[]->location[]->position[]->position entry[]->comment[]->event[]->type entry[]->reference[]->citation[]->authorList[]->person[]->name entry[]->feature[]->location[]->begin[]->position entry[]->reference[]->citation[]->name entry[]->reference[]->citation[]->editorList[] entry[]->feature[]->description entry[]->organism[]->dbReference[]->id entry[]->protein[]->domain[]->recommendedName[]->ref entry[]->comment[]->location[]->begin[] entry[]->protein[]->recommendedName[] entry[]->feature[]->type entry[]->organismHost[] entry[]->comment[]->absorption[]->text[] entry[]->feature[]->variation[] entry[]->geneLocation[]->name[] entry[]->evidence[]->type entry[]->comment[]->interactant[]->id[] entry[]->protein[]->cdAntigenName[]
entry[]->sequence[] entry[]->feature[] entry[]->reference[] entry[]->dbReference[]->property[] entry[]->reference[]->citation[]->last entry[]->comment[]->text[]->status entry[]->comment[]->experiments[] entry[]->geneLocation[]->type entry[]->feature[]->location[]->begin[]->status entry[]->evidence[]->category entry[]->reference[]->citation[]->volume entry[]->feature[]->evidence entry[]->dbReference[]->type entry[]->reference[]->citation[]->authorList[]->consortium[] entry[]->version entry[]->sequence[]->version entry[]->gene[]->name[]->evidence entry[]->proteinExistence[] entry[]->reference[]->scope[] entry[]->reference[]->source[]->plasmid[] entry[]->reference[]->citation[]->dbReference[] entry[]->reference[]->citation[]->publisher entry[]->gene[]->name[] entry[]->evidence[] entry[]->reference[]->evidence entry[]->sequence[]->modified entry[]->keyword[]->id entry[]->organismHost[]->dbReference[]->key entry[]->feature[]->location[]->position[] entry[]->reference[]->citation[]->dbReference[]->id entry[]->organismHost[]->name[]->type entry[]->organism[]->lineage[] entry[]->proteinExistence[]->type entry[]->reference[]->citation[]->editorList[]->person[] entry[]->organismHost[]->dbReference[]->type entry[]->sequence[]->checksum entry[]->gene[] entry[]->gene[]->name[]->type entry[]->dbReference[]->evidence entry[]->dbReference[]->id entry[]->reference[]->source[]->tissue[]->evidence entry[]->comment[]->interactant[] entry[]->protein[]->recommendedName[]->fullName[]->evidence entry[]->dbReference[]->property[]->type entry[]->feature[]->location[]->begin[] entry[]->protein[]->submittedName[]->ref entry[]->feature[]->location[]->end[]->position entry[]->reference[]->citation[]->type entry[]->reference[]->citation[]->authorList[]->consortium[]->name entry[]->sequence[]->precursor entry[]->sequence[]->mass entry[]->evidence[]->date entry[]->comment[]->subcellularLocation[]->location[] entry[]->feature[]->status entry[]->feature[]->location[]->position[]->position entry[]->comment[]->interactant[]->intactId entry[]->feature[]->location[]->end[]->status entry[]->protein[] entry[]->protein[]->submittedName[]->fullName[] entry[]->organism[]->lineage[]->taxon[] entry[]->keyword[] entry[]->reference[]->citation[]->city entry[]->reference[]->source[]->tissue[] entry[]->reference[]->citation[]->dbReference[]->type entry[]->comment[]->type entry[]->comment[]->subcellularLocation[]->topology[] entry[]->protein[]->recommendedName[]->fullName[] entry[]->feature[]->id entry[]->comment[]->subcellularLocation[]->topology[]->status entry[]->reference[]->citation[]->first entry[]->dbReference[] entry[]->accession[] entry[]->feature[]->location[] entry[]->comment[]->molecule[] entry[]->comment[]->text[] entry[]->comment[]->interactant[]->label[] entry[]->reference[]->citation[]->authorList[] entry[]->sequence[]->fragment entry[]->reference[]->source[]->plasmid[]->evidence entry[]->feature[]->location[]->end[] entry[]->reference[]->citation[]->number entry[]->organism[] entry[]->organismHost[]->dbReference[]->id entry[]->reference[]->citation[]->date entry[]->dbReference[]->property[]->value entry[]->reference[]->citation[]->editorList[]->person[]->name entry[]->reference[]->citation[]->title[] entry[]->comment[]->subcellularLocation[] entry[]->reference[]->citation[]->dbReference[]->key entry[]->organismHost[]->name[] entry[]->reference[]->source[]->strain[] entry[]->organism[]->name[] entry[]->comment[]->evidence entry[]->reference[]->citation[]->institute entry[]->name[] entry[]->organism[]->dbReference[] entry[]->reference[]->citation[] entry[]->organism[]->name[]->type entry[]->reference[]->source[] entry[]->organismHost[]->dbReference[] entry[] entry[]->dbReference[]->key entry[]->reference[]->citation[]->country entry[]->geneLocation[]->name[]->status entry[]->evidence[]->key entry[]->dataset entry[]->protein[]->recommendedName[]->ref entry[]->reference[]->citation[]->authorList[]->person[] entry[]->keyword[]->evidence entry[]->comment[]->organismsDiffer[] entry[]->reference[]->source[]->transposon[] entry[]->reference[]->citation[]->db entry[]->geneLocation[] entry[]->organism[]->dbReference[]->key entry[]->modified entry[]->sequence[]->length entry[]->reference[]->key entry[]->created entry[]->geneLocation[]->evidence entry[]->evidence[]->attribute entry[]->comment[] entry[]->organism[]->dbReference[]->type entry[]->comment[]->subcellularLocation[]->location[]->status entry[]->protein[]->submittedName[] entry[]->reference[]->citation[]->authorList[]->person[]->name entry[]->reference[]->citation[]->name entry[]->feature[]->location[]->begin[]->position entry[]->reference[]->source[]->strain[]->evidence entry[]->feature[]->description entry[]->reference[]->citation[]->editorList[] entry[]->organism[]->dbReference[]->id entry[]->protein[]->submittedName[]->fullName[]->evidence entry[]->protein[]->recommendedName[] entry[]->feature[]->type entry[]->organismHost[] entry[]->organism[]->evidence entry[]->evidence[]->type entry[]->geneLocation[]->name[] entry[]->comment[]->interactant[]->id[]
-- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql