[Jprogramming] How to improve speed searching for substrings using "E."?

vadim . Fri, 28 Jun 2019 10:12:59 -0700

Example: given a file (a string), where each line is list of part
numbers with separators, how to exclude lines which are substrings of
other lines (cut on separators)? E.g. (first line is to be excluded):


   z =: 0 : 0
A001|B002
C003|A001|B002
B002|A001
C003|D004|A001
E005|F006
D004|C003
)
   [lines =: <;._2 z
+---------+--------------+---------+--------------+---------+---------+
|A001|B002|C003|A001|B002|B002|A001|C003|D004|A001|E005|F006|D004|C003|
+---------+--------------+---------+--------------+---------+---------+
   [syms =: (s:@:('|'&,))&.> lines
+-----------+-----------------+-----------+-----------------+-----------+-----------+
|`A001 `B002|`C003 `A001 `B002|`B002 `A001|`C003 `D004 `A001|`E005
`F006|`D004 `C003|
+-----------+-----------------+-----------+-----------------+-----------+-----------+
   (+./@:E.)&.(> :.])/~ syms
1 1 0 0 0 0
0 1 0 0 0 0
0 0 1 0 0 0
0 0 0 1 0 0
0 0 0 0 1 0
0 0 0 0 0 1
   [idx =: 1&= +/"1 (+./@:E.)&.(> :.])/~ syms
0 1 1 1 1 1
   [result =: idx#lines
+--------------+---------+--------------+---------+---------+
|C003|A001|B002|B002|A001|C003|D004|A001|E005|F006|D004|C003|
+--------------+---------+--------------+---------+---------+

   I'm worried about performance of this line:

(+./@:E.)&.(> :.])/~ syms

other details are not very important, as I'm only learning. Phrase
above uses form "+./@:E.", recommended for speed in J Wiki. I think
table adverb must be optimized, too. But, adding some weight:

   z =: 0 : 0
2N0472|6N8595|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|6Y1352|4P0489|2757803
3419308|3514531|3525716|3557019|3586192|3635776|3783741
3T3625|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F7854
3T3625|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F7854|8W1152|8R0721
3T3628|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|6Y1352|4P0489|1336934
4N4906|6N6481|9L1366|1189902|1413983|8B2026|1M3381|7K3377
4N4906|6N6481|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788
6N7936|6N5049|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|6Y1352|4P0489|2757803
6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|6Y1352|4P0489|1336934
6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377
6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F7854|8W1152
)
   lines =: <;._2 z
   syms =: (s:@:('|'&,))&.> lines
   syms10 =: ,(i.10) ]"0 _ syms
   syms100 =: ,(i.100) ]"0 _ syms
   syms1000 =: ,(i.1000) ]"0 _ syms
   10 (6!:2) '(+./@:E.)&.(> :.])/~ syms'
3.936e_5
   10 (6!:2) '(+./@:E.)&.(> :.])/~ syms10'
0.00652701
   10 (6!:2) '(+./@:E.)&.(> :.])/~ syms100'
0.283609
   1 (6!:2) '(+./@:E.)&.(> :.])/~ syms1000'
28.7405

28 seconds for 11000 short lines is unacceptable. Am I doing something
totally wrong? For example, for this task Perl shows close to linear
(definitely not quadratic) dependency, runs hundreds and thousands
times faster.
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

[Jprogramming] How to improve speed searching for substrings using "E."?

Reply via email to