[
https://issues.apache.org/jira/browse/ARROW-16867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
youngfn updated ARROW-16867:
----------------------------
Attachment: 20220622-111516.png
> A CSV parser improvement idea
> -----------------------------
>
> Key: ARROW-16867
> URL: https://issues.apache.org/jira/browse/ARROW-16867
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Affects Versions: 8.0.0
> Environment: Architecture: x86_64
> CPU op-mode(s): 32-bit, 64-bit
> Byte Order: Little Endian
> CPU(s): 80
> On-line CPU(s) list: 0-79
> Thread(s) per core: 2
> Core(s) per socket: 20
> Socket(s): 2
> NUMA node(s): 2
> Vendor ID: GenuineIntel
> CPU family: 6
> Model: 85
> Model name: Intel(R) Xeon(R) Gold 6230N CPU @ 2.30GHz
> Stepping: 7
> CPU MHz: 1000.000
> CPU max MHz: 2301.0000
> CPU min MHz: 1000.0000
> BogoMIPS: 4600.00
> Virtualization: VT-x
> L1d cache: 32K
> L1i cache: 32K
> L2 cache: 1024K
> L3 cache: 28160K
> NUMA node0 CPU(s): 0-19,40-59
> NUMA node1 CPU(s): 20-39,60-79
> Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
> pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl
> xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 ds_cpl vmx smx
> est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe
> popcnt aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3
> intel_pt ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid
> fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a
> avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl
> xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local
> dtherm ida arat pln pts pku ospke spec_ctrl intel_stibp flush_l1d
> arch_capabilities
> Reporter: youngfn
> Priority: Major
> Attachments: 20220621-174727.png, 20220622-11065.png,
> 20220622-110658(WeLinkPC).png, 20220622-111516.png
>
>
> As I run a CSV reading test(reading from a big file with more than 200
> columns and only needing four of them) and I found the CSV parser cost most
> of the execution time.
> !20220621-174727.png!
> And I go through the ParseLine function, and I found Arrow will parse all
> columns of one row even though I just want only 4 columns, and I think it
> will be a great improvement if Arrow can add including_column to
> parser_option.
> I want to ask if this idea works or if you guys don't do this for some
> reason. Thanks in advance.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)