A simple solution is to use text analysis package such as quanteda require(quanteda)
drug_dictionary <- as.dictionary(data.frame(word = toupper(patterns), sentiment = patterns)) corpus(df$name) %>% tokens() %>% tokens_compound(drug_dictionary) %>% dfm %>% dfm_lookup(drug_dictionary) %>% quanteda::convert(to = "data.frame") On Tue, Apr 6, 2021 at 4:42 PM Felipe Barletta <felipe.e.barle...@gmail.com> wrote: > > Hi Gianpaolo, > > It works now, thank you! > > But it is not what I need exactly. > I will explain better. > > Your solution is good. To identify what is antibiotic and for this my > solution solved too: > > ###################################################### > matches <- unlist(sapply(patterns, function(p) grep(p, df$name, > value = FALSE, > ignore.case = TRUE) > ) > ) > anti <- df[matches,] > ######################################################## > > > But what I need, beyond identifying what is an antibiotic: > - Create a new variable (when the medicine is antibiotic - into the > patterns object) with the name from patterns name. > I did this with the code below - fuzzyjoin::regex_left_join() function: > > ######################################################### > #List of medicines that - object called patterns. > patterns <- c("Oritavancina", "Oxacilina", "Pefloxacino", "Penicilina", > "Pexiganan", "Piperacilina-tazobactam","Tazobactam", > "Pirazinamida", "Plazomicina", "Polimixina B", > "Posilozid","Piperacilina") > patterns <- toupper(patterns) > > # Sample Data frame where I need to find the names from the list above. > df <- data.frame(name = > c("CLORETO DE POTASSIO DRAGEA 600MG", > "CLORETO DE SODIO 0,9% SERINGA PREENCHIDA 5ML", > "CLORETO DE SODIO SOLUCAO INJETAVEL 0,9% 10ML", > "CODEINA FOSFATO SOLUCAO ORAL 3MGML 10ML ISCMPA @", > "CODEINA FOSFATO SOLUCAO ORAL 3MGML 5ML ISCMPA @", > "DipiRONA SOLUCAO INJETAVEL 500MGML 2ML", > "DipiRONA SOLUCAO INJETAVEL 500MGML 2ML", > "FUROSEMIDA SOLUCAO INJETAVEL 10MGML 2ML", > "HIDROCORTISONA SUCCINATO SODICO PO LIOFILO > INJETAVEL 100MG", > "ONDANSETRONA CLORIDRATO SOLUCAO INJETAVEL 2MGML > 4ML", > "ONDANSETRONA CLORIDRATO SOLUCAO INJETAVEL 2MGML > 4ML", > "Penicilina G BENZATINA PO LIOFILO INJETAVEL > 1200000UI", > "Penicilina G BENZATINA PO LIOFILO INJETAVEL > 1200000UI", > "PIPERACILINA SODICA 4G + TAZOBACTAM SODICA 0,5G PO > LIOFILO INJETAVEL")) > > > df <- df %>% mutate(name = toupper(name)) > patterns <- data.frame(name = patterns) > results <- fuzzyjoin::regex_left_join(df, > patterns, > by = "name") > results > ######################################################### > Notice, from results object, when the name of medicine is double > (PIPERACILINA SODICA 4G + TAZOBACTAM SODICA 0,5G PO LIOFILO INJETAVEL"), > the solution doesn't find "PIPERACILINA-TAZOBACTAM" > The code created two new lines PIPERACILINA and othe with TAZOBACTAM. > > I think that this explanation was more clear. > > > > > > > > > > > Em ter., 6 de abr. de 2021 às 03:55, Gianpaolo Romeo < > gianpaolo.ro...@gmail.com> escreveu: > > > Sorry, > > I wrote the code on a smartphone without using R, try this: > > > > require(dplyr) > > > > patterns <- c("Oritavancina", "Oxacilina", "Pefloxacino", "Penicilina", > > "Pexiganan", "Piperacilina", "Piperacilina-tazobactam", > > "Pirazinamida", "Plazomicina", "Polimixina B", > > "Posilozid") > > > > patterns.new <- paste(patterns, collapse = "|") > > > > > > df <- data.frame(name = > > c("CLORETO DE POTASSIO DRAGEA 600MG", > > "CLORETO DE SODIO 0,9% SERINGA PREENCHIDA 5ML", > > "CLORETO DE SODIO SOLUCAO INJETAVEL 0,9% 10ML", > > "CODEINA FOSFATO SOLUCAO ORAL 3MGML 10ML ISCMPA @", > > "CODEINA FOSFATO SOLUCAO ORAL 3MGML 5ML ISCMPA @", > > "DipiRONA SOLUCAO INJETAVEL 500MGML 2ML", > > "DipiRONA SOLUCAO INJETAVEL 500MGML 2ML", > > "FUROSEMIDA SOLUCAO INJETAVEL 10MGML 2ML", > > "HIDROCORTISONA SUCCINATO SODICO PO LIOFILO INJETAVEL > > 100MG", > > "ONDANSETRONA CLORIDRATO SOLUCAO INJETAVEL 2MGML 4ML", > > "ONDANSETRONA CLORIDRATO SOLUCAO INJETAVEL 2MGML 4ML", > > "Penicilina G BENZATINA PO LIOFILO INJETAVEL > > 1200000UI", > > "Penicilina G BENZATINA PO LIOFILO INJETAVEL > > 1200000UI", > > "PIPERACILINA SODICA 4G + TAZOBACTAM SODICA 0,5G > > POLIOFILO INJETAVEL")) > > > > > > results <- df %>% filter(grepl(pattern = patterns.new, x = name, > > ignore.case = TRUE)) > > > > Il giorno mar 6 apr 2021 alle ore 02:06 Felipe Barletta < > > felipe.e.barle...@gmail.com> ha scritto: > > > >> Thanks a lotados Gianpaolo, but your suggest didn't work. > >> > >> Em seg, 5 de abr de 2021 4:50 PM, Gianpaolo Romeo < > >> gianpaolo.ro...@gmail.com> escreveu: > >> > >>> I suggest you to use dplyr package: > >>> > >>> > >>> > >>> df %>% mutate(name = toupper(name)) %>% > >>> filter(grepl(pattern = patterns, name)) > >>> > >>> > >>> If you want ti search every string that start exactly with a spedific > >>> word: > >>> > >>> patterns <- paste0("^", patterns)) > >>> > >>> > >>> Il lun 5 apr 2021, 20:25 Felipe Barletta <felipe.e.barle...@gmail.com> > >>> ha scritto: > >>> > >>>> Hi friends, > >>>> > >>>> Hi friends, > >>>> > >>>> I need to identify medicines names in a data set. > >>>> I have a list of antibiotic names and I need to identify those names in > >>>> a > >>>> sample. > >>>> > >>>> When the name of the medicine is simple, my solution worked, see: > >>>> > >>>> #List of medicines that - object called patterns. > >>>> patterns <- c("Oritavancina", "Oxacilina", "Pefloxacino", "Penicilina", > >>>> "Pexiganan", "Piperacilina", "Piperacilina-tazobactam", > >>>> "Pirazinamida", "Plazomicina", "Polimixina B", > >>>> "Posilozid") > >>>> > >>>> > >>>> # Sample Data frame where I need to find the names from the list above. > >>>> df <- data.frame(name = > >>>> c("CLORETO DE POTASSIO DRAGEA 600MG", > >>>> "CLORETO DE SODIO 0,9% SERINGA PREENCHIDA 5ML", > >>>> "CLORETO DE SODIO SOLUCAO INJETAVEL 0,9% 10ML", > >>>> "CODEINA FOSFATO SOLUCAO ORAL 3MGML 10ML ISCMPA > >>>> @", > >>>> "CODEINA FOSFATO SOLUCAO ORAL 3MGML 5ML ISCMPA > >>>> @", > >>>> "DipiRONA SOLUCAO INJETAVEL 500MGML 2ML", > >>>> "DipiRONA SOLUCAO INJETAVEL 500MGML 2ML", > >>>> "FUROSEMIDA SOLUCAO INJETAVEL 10MGML 2ML", > >>>> "HIDROCORTISONA SUCCINATO SODICO PO LIOFILO > >>>> INJETAVEL 100MG", > >>>> "ONDANSETRONA CLORIDRATO SOLUCAO INJETAVEL 2MGML > >>>> 4ML", > >>>> "ONDANSETRONA CLORIDRATO SOLUCAO INJETAVEL 2MGML > >>>> 4ML", > >>>> "Penicilina G BENZATINA PO LIOFILO INJETAVEL > >>>> 1200000UI", > >>>> "Penicilina G BENZATINA PO LIOFILO INJETAVEL > >>>> 1200000UI", > >>>> "PIPERACILINA SODICA 4G + TAZOBACTAM SODICA 0,5G > >>>> PO > >>>> LIOFILO INJETAVEL")) > >>>> > >>>> > >>>> > >>>> > >>>> results <- regex_left_join(df, > >>>> patterns, > >>>> by = "name") > >>>> > >>>> head(results) > >>>> > >>>> # Identify with grep() - other way. > >>>> matches <- unlist(sapply(patterns, function(p) grep(p, df$name, > >>>> value = FALSE, > >>>> ignore.case = TRUE) > >>>> ) > >>>> ) > >>>> > >>>> anti <- df[matches,] > >>>> > >>>> However, when the name is composed it does not work (for example: > >>>> Piperacillin-tazobactam) > >>>> > >>>> Can anyone help me in this issue? > >>>> > >>>> [[alternative HTML version deleted]] > >>>> > >>>> _______________________________________________ > >>>> R-sig-Epi@r-project.org mailing list > >>>> https://stat.ethz.ch/mailman/listinfo/r-sig-epi > >>>> > >>> > > [[alternative HTML version deleted]] > > _______________________________________________ > R-sig-Epi@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-sig-epi _______________________________________________ R-sig-Epi@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-epi