<[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > Hi, > > I have a file with several entries in the form: > > AFFX-BioB-5_at E. coli /GEN=bioB /gb:J04423.1 NOTE=SIF > corresponding to nucleotides 2032-2305 of /gb:J04423.1 DEF=E.coli > 7,8-diamino-pelargonic acid (bioA), biotin synthetase (bioB), > 7-keto-8-amino-pelargonic acid synthetase (bioF), bioC protein, and > dethiobiotin synthetase (bioD), complete cds. > > 1415785_a_at /gb:NM_009840.1 /DB_XREF=gi:6753327 /GEN=Cct8 /FEA=FLmRNA > /CNT=482 /TID=Mm.17989.1 /TIER=FL+Stack /STK=281 /UG=Mm.17989 /LL=12469 > /DEF=Mus musculus chaperonin subunit 8 (theta) (Cct8), mRNA. > /PROD=chaperonin subunit 8 (theta) /FL=/gb:NM_009840.1 /gb:BC009007.1 > > and I would like to create a file that has only the following: > > AFFX-BioB-5_at /GEN=bioB /gb:J04423.1 > > 1415785_a_at /gb:NM_009840.1 /GEN=Cct8
Here's a pyparsing solution that will address your immediate question, and also gives you some leeway for adding other "/" options to your search. Pyparsing's home page is at pyparsing.wikispaces.com. -- Paul data = """ AFFX-BioB-5_at E. coli /GEN=bioB /gb:J04423.1 NOTE=SIF corresponding to nucleotides 2032-2305 of /gb:J04423.1 DEF=E.coli 7,8-diamino-pelargonic acid (bioA), biotin synthetase (bioB), 7-keto-8-amino-pelargonic acid synthetase (bioF), bioC protein, and dethiobiotin synthetase (bioD), complete cds. 1415785_a_at /gb:NM_009840.1 /DB_XREF=gi:6753327 /GEN=Cct8 /FEA=FLmRNA /CNT=482 /TID=Mm.17989.1 /TIER=FL+Stack /STK=281 /UG=Mm.17989 /LL=12469 /DEF=Mus musculus chaperonin subunit 8 (theta) (Cct8), mRNA. /PROD=chaperonin subunit 8 (theta) /FL=/gb:NM_009840.1 /gb:BC009007.1 """ from pyparsing import * # create expression we are looking for: # name [ junk word... ] /qualifier... name = Word(alphanums,printables).setResultsName("name") junkWord = ~(Literal("/")) + Word(printables) qualifier = ("/" + Word(alphas+"_-.").setResultsName("key") + \ oneOf("= :") + \ Word(printables).setResultsName("value")) expr = name + ZeroOrMore(junkWord) + \ Dict(ZeroOrMore(qualifier)).setResultsName("quals") # use parse action to repackage qualifier data to support "dict"-like # access to qualifiers qualifier.setParseAction( lambda t: (t.key,"".join(t)) ) # use this parse action instead if you just want whatever is # after the '=' or ':' delimiter in the qualifier # qualifier.setParseAction( lambda t: (t.key,t.value) ) # parse data strings, showing returned data structure # (just to show what pyparsing results structure looks like) for d in data.split("\n\n"): res = expr.parseString(d) print res.dump() print print # now just do what the OP wanted in the first place for d in data.split("\n\n"): res = expr.parseString(d) print res.name, res.quals["gb"], res.quals["GEN"] Gives these results: ['AFFX-BioB-5_at', 'E.', 'coli', [('GEN', '/GEN=bioB'), ('gb', '/gb:J04423.1')]] - name: AFFX-BioB-5_at - quals: [('GEN', '/GEN=bioB'), ('gb', '/gb:J04423.1')] - GEN: /GEN=bioB - gb: /gb:J04423.1 ['1415785_a_at', [('gb', '/gb:NM_009840.1'), ('DB_XREF', '/DB_XREF=gi:6753327'), ('GEN', '/GEN=Cct8'), ('FEA', '/FEA=FLmRNA'), ('CNT', '/CNT=482'), ('TID', '/TID=Mm.17989.1'), ('TIER', '/TIER=FL+Stack'), ('STK', '/STK=281'), ('UG', '/UG=Mm.17989'), ('LL', '/LL=12469'), ('DEF', '/DEF=Mus')]] - name: 1415785_a_at - quals: [('gb', '/gb:NM_009840.1'), ('DB_XREF', '/DB_XREF=gi:6753327'), ('GEN', '/GEN=Cct8'), ('FEA', '/FEA=FLmRNA'), ('CNT', '/CNT=482'), ('TID', '/TID=Mm.17989.1'), ('TIER', '/TIER=FL+Stack'), ('STK', '/STK=281'), ('UG', '/UG=Mm.17989'), ('LL', '/LL=12469'), ('DEF', '/DEF=Mus')] - CNT: /CNT=482 - DB_XREF: /DB_XREF=gi:6753327 - DEF: /DEF=Mus - FEA: /FEA=FLmRNA - GEN: /GEN=Cct8 - LL: /LL=12469 - STK: /STK=281 - TID: /TID=Mm.17989.1 - TIER: /TIER=FL+Stack - UG: /UG=Mm.17989 - gb: /gb:NM_009840.1 AFFX-BioB-5_at /gb:J04423.1 /GEN=bioB 1415785_a_at /gb:NM_009840.1 /GEN=Cct8 -- http://mail.python.org/mailman/listinfo/python-list