[jira] [Updated] (ARROW-16867) A CSV parser improvement idea

youngfn (Jira) Tue, 21 Jun 2022 20:16:08 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-16867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


youngfn updated ARROW-16867:
----------------------------
    Attachment: 20220622-111516.png

> A CSV parser improvement idea
> -----------------------------
>
>                 Key: ARROW-16867
>                 URL: https://issues.apache.org/jira/browse/ARROW-16867
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>    Affects Versions: 8.0.0
>         Environment: Architecture:          x86_64
> CPU op-mode(s):        32-bit, 64-bit
> Byte Order:            Little Endian
> CPU(s):                80
> On-line CPU(s) list:   0-79
> Thread(s) per core:    2
> Core(s) per socket:    20
> Socket(s):             2
> NUMA node(s):          2
> Vendor ID:             GenuineIntel
> CPU family:            6
> Model:                 85
> Model name:            Intel(R) Xeon(R) Gold 6230N CPU @ 2.30GHz
> Stepping:              7
> CPU MHz:               1000.000
> CPU max MHz:           2301.0000
> CPU min MHz:           1000.0000
> BogoMIPS:              4600.00
> Virtualization:        VT-x
> L1d cache:             32K
> L1i cache:             32K
> L2 cache:              1024K
> L3 cache:              28160K
> NUMA node0 CPU(s):     0-19,40-59
> NUMA node1 CPU(s):     20-39,60-79
> Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx 
> pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl 
> xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 ds_cpl vmx smx 
> est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe 
> popcnt aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 
> intel_pt ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid 
> fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a 
> avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl 
> xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local 
> dtherm ida arat pln pts pku ospke spec_ctrl intel_stibp flush_l1d 
> arch_capabilities
>            Reporter: youngfn
>            Priority: Major
>         Attachments: 20220621-174727.png, 20220622-11065.png, 
> 20220622-110658(WeLinkPC).png, 20220622-111516.png
>
>
> As I run a CSV reading test(reading from a big file with more than 200 
> columns and only needing four of them) and I found the CSV parser cost most 
> of the execution time. 
> !20220621-174727.png!
> And I go through the ParseLine function, and I found Arrow will parse all 
> columns of one row even though I just want only 4 columns, and I think it 
> will be a great improvement if Arrow can add including_column to 
> parser_option.
> I want to ask if this idea works or if you guys don't do this for some 
> reason. Thanks in advance.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (ARROW-16867) A CSV parser improvement idea

Reply via email to